On 2019/9/27 上午2:33, Josef Bacik wrote:
> On Thu, Sep 26, 2019 at 02:35:45PM +0800, Qu Wenruo wrote:
>> [BUG]
>> There is one BUG_ON() report where a transaction is aborted during
>> balance, then kernel BUG_ON() in merge_reloc_roots():
>>
>>   void merge_reloc_roots(struct reloc_control *rc)
>>   {
>>      ...
>>      BUG_ON(!RB_EMPTY_ROOT(&rc->reloc_root_tree.rb_root)); <<<
>>   }
>>
>> [CAUSE]
>> It's still uncertain why we can get to such situation.
>> As all __add_reloc_root() calls will also link that reloc root to
>> rc->reloc_roots, and in merge_reloc_roots() we cleanup rc->reloc_roots.
>>
>> So the root cause is still uncertain.
>>
>> [FIX]
>> But we can still hunt down all the BUG_ON() in merge_reloc_roots().
>>
>> There are 3 BUG_ON()s in it:
>> - BUG_ON() for read_fs_root() result
>> - BUG_ON() for root->reloc_root != reloc_root case
>> - BUG_ON() for the non-empty reloc_root_tree
>>
>> For the first two, just grab the return value and goto out tag for
>> cleanup.
>>
>> For the last one, make it more graceful by:
>> - grab the lock before doing read/write
>> - warn instead of panic
>> - cleanup the nodes in rc->reloc_root_tree
>>
>> Signed-off-by: Qu Wenruo <w...@suse.com>
> 
> If we're going to do these things we might as well add the ability to error
> inject and add an xfstest to verify they don't break anything.

BCC has tool to inject certain errors to memory allocation.
And I have btrfs-profiler tool to inject any overrideable function with
similiar callchain condition.

The problem is, it's not something can be provided through xfstest itself.

Furthermore, I see no reason why merge_reloc_root() failure could lead
to the final BUG_ON(), thus I'm not sure even with error injection we
could hit the final BUG_ON(), meaning even with error injection, we may
not get full coverage.

But anyway, I'll add the error injection test result in the next update.

Thanks,
Qu

>  The
> BUG_ON(root->reloc_root != reloc_root) isn't able to be tested, but you can at
> least make read_fs_root() and merge_reloc_root() and then test this patch by
> triggering errors in these paths.
> 
> If you do that I'm ok with not being able to explain the leaking nodes, 
> getting
> rid of the BUG_ON()'s is good enough reason.  Thanks,
> 
> Josef
> 

Reply via email to