On 01.03.2018 23:20, Andreas Dilger wrote:
> 
> On Mar 1, 2018, at 9:04 AM, Theodore Ts'o <ty...@mit.edu> wrote:
>> This doesn't seem to make sense; the PC is where we are currently
>> executing, and LR is the "Link Register" where the flow of control
>> will be returning after the current function returns, right?  Well,
>> dx_probe should *not* be returning to __wait_on_bit().  So this just
>> seems.... weird.
>>
>> Ignoring the LR register, this stack trace looks sane...  I can't see
>> which pointer could be NULL and getting dereferenced, though.  How
>> easily can you reproduce the problem?  Can you either (a) translate
>> the PC into a line number, or better yet, if you can reproduce, add a
>> series of BUG_ON's so we can see what's going on?

Ted, thank you for the suggestion. I don't have a bug-reproducer, it happens
only under some IO load and quite randomly. I've applied the BUG_ON()'s, but it
may take some time to catch the bug again.

>> +    BUG_ON(frame);
> 
> I think you mean:
>       BUG_ON(frame == NULL);
> or
>       BUG_ON(!frame);
> 
> 
>>      memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>>      frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>>      if (IS_ERR(frame->bh))
>>              return (struct dx_frame *) frame->bh;
>>
>> +    BUG_ON(frame->bh);
>> +    BUG_ON(frame->bh->b_data);
> 
> Same here.
> 
>       BUG_ON(frame->bh == NULL);
>       BUG_ON(frame->bh->b_data == NULL);
> 
> This is why I don't like implicit "is NULL" or "is non-zero" usage.  Lustre
> used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to
> abandon that because of upstream code style.

Well spotted, thanks Andreas.

>>      root = (struct dx_root *) frame->bh->b_data;
>>      if (root->info.hash_version != DX_HASH_TEA &&
>>          root->info.hash_version != DX_HASH_HALF_MD4 &&
>>          root->info.hash_version != DX_HASH_LEGACY) {
>>
>> These are "could never" happen scenarios from looking at the code, but
>> that will help explain what is going on.
>>
>> If this is reliably only happening with mq, the only way I could see
>> that if is something is returning an error when it previously wasn't.
>> This isn't a problem we're seeing with any of our testing, though.

Reply via email to