On Fri, Apr 24, 2015 at 2:00 PM, Chris Mason <c...@fb.com> wrote:
> On 04/24/2015 02:34 AM, Filipe David Manana wrote:
>> On Thu, Apr 23, 2015 at 8:50 PM, Chris Mason <c...@fb.com> wrote:
>>> On 04/23/2015 03:43 PM, Filipe David Manana wrote:
>>>> On Thu, Apr 23, 2015 at 4:48 PM, Filipe David Manana <fdman...@gmail.com> 
>>>> wrote:
>>>>> On Thu, Apr 23, 2015 at 4:17 PM, Chris Mason <c...@fb.com> wrote:
>>>>>> On Thu, Apr 23, 2015 at 02:05:48PM +0100, Filipe David Manana wrote:
>>>>>>>>> Trying the current integration-4.1 branch, I ran into the following
>>>>>>>>> during xfstests/btrfs/049:
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ugh, I must not be waiting correctly in one of the inode cache writeout
>>>>>>>> sections.  But I've run 049 a whole bunch of times without triggering,
>>>>>>>> can you get this to happen consistently?
>>>>>>>
>>>>>>> All the time so far.
>>>>>>
>>>>>> I'm testing with this now:
>>>>>>
>>>>>> commit 9f433238891b1b243c4f19d3f36eed913b270cbc
>>>>>> Author: Chris Mason <c...@fb.com>
>>>>>> Date:   Thu Apr 23 08:02:49 2015 -0700
>>>>>>
>>>>>>     Btrfs: fix inode cache writeout
>>>>>>
>>>>>>     The code to fix stalls during free spache cache IO wasn't using
>>>>>>     the correct root when waiting on the IO for inode caches.  This
>>>>>>     is only a problem when the inode cache is enabled with
>>>>>>
>>>>>>     mount -o inode_cache
>>>>>>
>>>>>>     This fixes the inode cache writeout to preserve any error values and
>>>>>>     makes sure not to override the root when inode cache writeout is 
>>>>>> done.
>>>>>>
>>>>>>     Reported-by: Filipe Manana <fdman...@suse.com>
>>>>>>     Signed-off-by: Chris Mason <c...@fb.com>
>>>>>
>>>>> Thanks, btrfs/049 now passes with that patch applied.
>>>>> Running the whole xfstests suite now.
>>>>
>>>> btrfs/066 also failed once during final fsck with:
>>>>
>>>> _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>>> *** fsck.btrfs output ***
>>>> checking extents
>>>> checking free space cache
>>>> There is no free space entry for 21676032-21680128
>>>> There is no free space entry for 21676032-87031808
>>>> cache appears valid but isnt 20971520
>>>
>>> Josef has a btrfs-progs patch for this.  The kernel will toss the cache.
>>>  There's a somewhat fundamental race in cache writeout this patch makes
>>> a little bigger, but it has always been there.
>>>
>>> (compare what find_free_extent can do with no trans running vs the
>>> actual cache writeback)
>>
>> There's also one list corruption I didn't get before and happened
>> while running fsstress (btrfs/078), apparently due to some race:
>
> Can you please bang on this and get a more reliable reproduction? I'll
> take a look.

Not really that easy to get a more reliable reproducer - just run
fsstress with multiple processes - it already happened twice again
after I sent the previous mail.
>From the quick look I had at this, this seems to be the change causing
the problem:

http://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?h=for-linus-4.1&id=1bbc621ef28462456131c035eaeb5567a1a2a2fe

Early in btrfs_commit_transaction(), btrfs_start_dirty_block_groups()
is called which ends up calling __btrfs_write_out_cache() for each
dirty block group, which collects all the bitmap entries from the bg's
space cache into a local list while holding the cache's ctl->tree_lock
(to serialize with concurrent allocation requests).

Then we unlock ctl->tree_lock, do other stuff and later acquire
ctl->tree_lock again and call write_bitmap_entries() to write the
bitmap entries we previously collected. However, while we were doing
the other stuff without holding that lock, allocation requests might
have happened right? - since when we call
btrfs_start_dirty_block_groups() in btrfs_commit_transaction() the
transaction state wasn't yet changed, allowing other tasks to join the
current transaction. If such other task allocates all the remaining
space from a bitmap entry we collected before (because it's still in
the space cache's rbtree), it ends up deleting it and freeing its
->bitmap member, which results in an invalid memory access (and the
warning on the list corruption) when we later call
write_bitmap_entries() in __btrfs_write_out_cache() - which is what
the second part of the trace I sent says:


[25590.890573] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[25590.892522] CPU: 3 PID: 7280 Comm: fsstress Tainted: G        W
  4.0.0-rc5-btrfs-next-9+ #1
[25590.892522] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
[25590.892522] task: ffff88040f485bd0 ti: ffff8803f0318000 task.ti:
ffff8803f0318000
[25590.892522] RIP: 0010:[<ffffffffa04cff2c>]  [<ffffffffa04cff2c>]
write_bitmap_entries+0x66/0xbd [btrfs]
(...)
[25590.892522] Stack:
[25590.892522]  ffffffffa04d0858 ffff880041710df0 ffff880041710c00
ffff88033ba59c90
[25590.892522]  0000000000000000 ffff8803efb6ff70 ffff8803f031bd78
ffffffffa04d0864
[25590.892522]  ffff88037cae7f40 ffff88033ba599d0 ffff880226cd4000
ffff8803efb6ff00
[25590.892522] Call Trace:
[25590.892522]  [<ffffffffa04d0858>] ?
__btrfs_write_out_cache.isra.21+0x20b/0x3a1 [btrfs]
[25590.892522]  [<ffffffffa04d0864>]
__btrfs_write_out_cache.isra.21+0x217/0x3a1 [btrfs]
[25590.892522]  [<ffffffffa04d1269>] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
[25590.892522]  [<ffffffffa04d12bb>] btrfs_write_out_cache+0x93/0xdc [btrfs]
[25590.892522]  [<ffffffffa04888e7>] ?
btrfs_start_dirty_block_groups+0x156/0x29b [btrfs]
[25590.892522]  [<ffffffffa0488977>]
btrfs_start_dirty_block_groups+0x1e6/0x29b [btrfs]
[25590.892522]  [<ffffffffa04970f2>]
btrfs_commit_transaction+0x130/0x9c9 [btrfs]

And write_bitmap_entries+0x66 corresponds to:

static int io_ctl_add_bitmap(struct btrfs_io_ctl *io_ctl, void *bitmap)
{
(...)
memcpy(io_ctl->cur, bitmap, PAGE_CACHE_SIZE);
(...)
}

I.e. the ->bitmap field from an entry we pass from write_bitmap_entries().
So releasing the tree_lock after collecting the bitmaps and before
writing them out seems to me what leads to the race and invalid memory
access. But I might be wrong, I haven't spent more than 5 minutes
analyzing it.


>
> -chris
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to