On 2019/9/27 上午2:33, Josef Bacik wrote: > On Thu, Sep 26, 2019 at 02:35:45PM +0800, Qu Wenruo wrote: >> [BUG] >> There is one BUG_ON() report where a transaction is aborted during >> balance, then kernel BUG_ON() in merge_reloc_roots(): >> >> void merge_reloc_roots(struct reloc_control *rc) >> { >> ... >> BUG_ON(!RB_EMPTY_ROOT(&rc->reloc_root_tree.rb_root)); <<< >> } >> >> [CAUSE] >> It's still uncertain why we can get to such situation. >> As all __add_reloc_root() calls will also link that reloc root to >> rc->reloc_roots, and in merge_reloc_roots() we cleanup rc->reloc_roots. >> >> So the root cause is still uncertain. >> >> [FIX] >> But we can still hunt down all the BUG_ON() in merge_reloc_roots(). >> >> There are 3 BUG_ON()s in it: >> - BUG_ON() for read_fs_root() result >> - BUG_ON() for root->reloc_root != reloc_root case >> - BUG_ON() for the non-empty reloc_root_tree >> >> For the first two, just grab the return value and goto out tag for >> cleanup. >> >> For the last one, make it more graceful by: >> - grab the lock before doing read/write >> - warn instead of panic >> - cleanup the nodes in rc->reloc_root_tree >> >> Signed-off-by: Qu Wenruo <w...@suse.com> > > If we're going to do these things we might as well add the ability to error > inject and add an xfstest to verify they don't break anything.
BCC has tool to inject certain errors to memory allocation. And I have btrfs-profiler tool to inject any overrideable function with similiar callchain condition. The problem is, it's not something can be provided through xfstest itself. Furthermore, I see no reason why merge_reloc_root() failure could lead to the final BUG_ON(), thus I'm not sure even with error injection we could hit the final BUG_ON(), meaning even with error injection, we may not get full coverage. But anyway, I'll add the error injection test result in the next update. Thanks, Qu > The > BUG_ON(root->reloc_root != reloc_root) isn't able to be tested, but you can at > least make read_fs_root() and merge_reloc_root() and then test this patch by > triggering errors in these paths. > > If you do that I'm ok with not being able to explain the leaking nodes, > getting > rid of the BUG_ON()'s is good enough reason. Thanks, > > Josef >