Re: Regression: kernel 4.0.0-rc1 - soft lockups

Marcel Ritter Tue, 03 Mar 2015 21:59:12 -0800

Hi,

just a short update on this topic:


I also tried the Ubuntu 4.0.0-rc1 ppa kernel -> problems are still there.

Luckily kernel 4.0.0-rc2 was released yesterday:
I updated my machine to kernel 4.0.0-rc2 and the problems are gone
(test script has been running fine for about 12 hours now)

Bye,
    Marcel

2015-03-03 12:05 GMT+01:00 Liu Bo <bo.li....@oracle.com>:
> On Tue, Mar 03, 2015 at 08:31:10AM +0100, Marcel Ritter wrote:
>> Hi,
>>
>> yes it is reproducible.
>>
>> Just creating a new btrfs filesystem (14 disks, data/mdata raid6,
>> latest git btrfs-progs)
>> and mounting this filesystems causes the system to hang (I think I once even 
>> got
>> it mounted, but it did hang shortly after when dd started writing to it).
>>
>> I just ran some quick tests and (at least at first sight) it looks
>> like the raid5/6
>> code may cause the trouble:
>>
>> I created different btrfs filesystem types, mounted them and (if possible)
>> did a big "dd" on the filesystem:
>>
>> mkfs.btrfs /dev/cciss/c1d* -m raid0 -d raid0 -f -> no problem (only short 
>> test)
>> mkfs.btrfs /dev/cciss/c1d* -m raid1 -d raid1 -f -> no problem (only short 
>> test)
>> mkfs.btrfs /dev/cciss/c1d* -m raid5 -d raid5 -f -> (almost) instant hang
>> mkfs.btrfs /dev/cciss/c1d* -m raid6 -d raid6 -f -> (almost) instant
>> hang (standard test)
>>
>> Once the machine is up again I'll do some more testing (variing the 
>> combination
>> of data and mdata raid levels)
>
> Hmm, just FYI, raid5&6 works good on my box with 4.0.0 rc1.
>
> Thanks,
>
> -liubo
>
>>
>> Bye,
>>    Marcel
>>
>>
>> 2015-03-03 7:37 GMT+01:00 Liu Bo <bo.li....@oracle.com>:
>> > On Tue, Mar 03, 2015 at 07:02:01AM +0100, Marcel Ritter wrote:
>> >> Hi,
>> >>
>> >> yesterday I did a kernel update on my btrfs test system (Ubuntu
>> >> 14.04.2) from custom-build kernel 3.19-rc6 to 4.0.0-rc1.
>> >>
>> >> Almost instantly after starting my test script, the system got stuck
>> >> with soft lockups (the machine was running the very same test for
>> >> weeks on the old kernel without problems,
>> >> basically doing massive streaming i/o on a raid6 btrfs volume):
>> >>
>> >> I found 2 types of messages in the logs:
>> >>
>> >> one btrfs related:
>> >>
>> >> [34165.540004] INFO: rcu_sched detected stalls on CPUs/tasks: { 3 7}
>> >> (detected by 6, t=6990777 jiffies, g=67455, c=67454, q=0)
>> >> [34165.540004] Task dump for CPU 3:
>> >> [34165.540004] mount           D ffff8803ed266000     0 15156  15110 
>> >> 0x00000000
>> >> [34165.540004]  0000000000000158 0000000000000014 ffff8803ecc13718
>> >> ffff8803ecc136d8
>> >> [34165.540004]  ffffffff8106075a 0000000000000000 0000000000000002
>> >> 0000000000000000
>> >> [34165.540004]  00000000ecc13728 ffff8803eb603128 0000000000000000
>> >> 0000000000000000
>> >> [34165.540004] Call Trace:
>> >> [34165.540004]  [<ffffffff8106075a>] ? __do_page_fault+0x2fa/0x440
>> >> [34165.540004]  [<ffffffff810608d1>] ? do_page_fault+0x31/0x70
>> >> [34165.540004]  [<ffffffff81792778>] ? page_fault+0x28/0x30
>> >> [34165.540004]  [<ffffffff810ae2ce>] ? pick_next_task_fair+0x53e/0x880
>> >> [34165.540004]  [<ffffffff810ae2ce>] ? pick_next_task_fair+0x53e/0x880
>> >> [34165.540004]  [<ffffffff8109707c>] ? dequeue_task+0x5c/0x80
>> >> [34165.540004]  [<ffffffff8178b9a3>] ? __schedule+0xf3/0x960
>> >> [34165.540004]  [<ffffffff8178c247>] ? schedule+0x37/0x90
>> >> [34165.540004]  [<ffffffffa0896375>] ?
>> >> btrfs_start_ordered_extent+0xd5/0x110 [btrfs]
>> >> [34165.540004]  [<ffffffff810b3cb0>] ? prepare_to_wait_event+0x110/0x110
>> >> [34165.540004]  [<ffffffffa0896884>] ?
>> >> btrfs_wait_ordered_range+0xc4/0x120 [btrfs]
>> >> [34165.540004]  [<ffffffffa08c0c18>] ?
>> >> __btrfs_write_out_cache+0x378/0x470 [btrfs]
>> >> [34165.540004]  [<ffffffffa08c104a>] ? btrfs_write_out_cache+0x9a/0x100 
>> >> [btrfs]
>> >> [34165.540004]  [<ffffffffa086af79>] ?
>> >> btrfs_write_dirty_block_groups+0x159/0x560 [btrfs]
>> >> [34165.540004]  [<ffffffffa08f2aa6>] ? commit_cowonly_roots+0x18d/0x2a4 
>> >> [btrfs]
>> >> [34165.540004]  [<ffffffffa087bd31>] ?
>> >> btrfs_commit_transaction+0x521/0xa50 [btrfs]
>> >> [34165.540004]  [<ffffffffa08a3fbe>] ? btrfs_create_uuid_tree+0x5e/0x110 
>> >> [btrfs]
>> >> [34165.540004]  [<ffffffffa087963f>] ? open_ctree+0x1dff/0x2200 [btrfs]
>> >> [34165.540004]  [<ffffffffa084f7ce>] ? btrfs_mount+0x75e/0x8f0 [btrfs]
>> >> [34165.540004]  [<ffffffff811ecbf9>] ? mount_fs+0x39/0x180
>> >> [34165.540004]  [<ffffffff81192405>] ? __alloc_percpu+0x15/0x20
>> >> [34165.540004]  [<ffffffff812082bb>] ? vfs_kern_mount+0x6b/0x120
>> >> [34165.540004]  [<ffffffff8120afe4>] ? do_mount+0x204/0xb30
>> >> [34165.540004]  [<ffffffff8120bc0b>] ? SyS_mount+0x8b/0xe0
>> >> [34165.540004]  [<ffffffff817905ed>] ? system_call_fastpath+0x16/0x1b
>> >> [34165.540004] Task dump for CPU 7:
>> >> [34165.540004] kworker/u16:1   R  running task        0 14518      2 
>> >> 0x00000008
>> >> [34165.540004] Workqueue: btrfs-freespace-write
>> >> btrfs_freespace_write_helper [btrfs]
>> >> [34165.540004]  0000000000000200 ffff8803eac6fdf8 ffffffffa08ac242
>> >> ffff8803eac6fe48
>> >> [34165.540004]  ffffffff8108b64f 00000000f1091400 0000000000000000
>> >> ffff8803eca58000
>> >> [34165.540004]  ffff8803ea9ed3c0 ffff8803f1091418 ffff8803f1091400
>> >> ffff8803eca58000
>> >> [34165.540004] Call Trace:
>> >> [34165.540004]  [<ffffffffa08ac242>] ?
>> >> btrfs_freespace_write_helper+0x12/0x20 [btrfs]
>> >> [34165.540004]  [<ffffffff8108b64f>] ? process_one_work+0x14f/0x420
>> >> [34165.540004]  [<ffffffff8108be08>] ? worker_thread+0x118/0x510
>> >> [34165.540004]  [<ffffffff8108bcf0>] ? rescuer_thread+0x3d0/0x3d0
>> >> [34165.540004]  [<ffffffff81091212>] ? kthread+0xd2/0xf0
>> >> [34165.540004]  [<ffffffff81091140>] ? kthread_create_on_node+0x180/0x180
>> >> [34165.540004]  [<ffffffff8179053c>] ? ret_from_fork+0x7c/0xb0
>> >> [34165.540004]  [<ffffffff81091140>] ? kthread_create_on_node+0x180/0x180
>> >>
>> >>
>> >> and one general (related to "native_flush_tlb_other":
>> >>
>> >> [34152.604004] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s!
>> >> [rs:main Q:Reg:490]
>> >> [34152.604004] Modules linked in: btrfs(E) xor(E) radeon(E) ttm(E)
>> >> drm_kms_helper(E) kvm(E) drm(E) raid6_pq(E) i2c_algo_bit(E) ipmi_si
>> >> (E) amd64_edac_mod(E) serio_raw(E) hpilo(E) hpwdt(E) edac_core(E)
>> >> shpchp(E) k8temp(E) mac_hid(E) edac_mce_amd(E) nfsd(E) auth_rpcgss(E
>> >> ) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) lp(E)
>> >> parport(E) hpsa(E) pata_acpi(E) hid_generic(E) psmouse(E) usbhid(E) b
>> >> nx2(E) cciss(E) hid(E) pata_amd(E)
>> >> [34152.604004] CPU: 6 PID: 490 Comm: rs:main Q:Reg Tainted: G      D W
>> >>   EL  4.0.0-rc1-custom #1
>> >> [34152.604004] Hardware name: HP ProLiant DL585 G2   , BIOS A07 05/02/2011
>> >> [34152.604004] task: ffff8803eecd9910 ti: ffff8803ecb30000 task.ti:
>> >> ffff8803ecb30000
>> >> [34152.604004] RIP: 0010:[<ffffffff810f1e3a>]  [<ffffffff810f1e3a>]
>> >> smp_call_function_many+0x20a/0x270
>> >> [34152.604004] RSP: 0018:ffff8803ecb33cf8  EFLAGS: 00000202
>> >> [34152.604004] RAX: 0000000000000000 RBX: ffffffff81cdd140 RCX: 
>> >> ffff8803ffc19700
>> >> [34152.604004] RDX: 0000000000000000 RSI: 0000000000000100 RDI: 
>> >> 0000000000000000
>> >> [34152.604004] RBP: ffff8803ecb33d38 R08: ffff8803ffd961c8 R09: 
>> >> 0000000000000004
>> >> [34152.604004] R10: 0000000000000004 R11: 0000000000000246 R12: 
>> >> 0000000000000000
>> >> [34152.604004] R13: ffff880300000040 R14: ffff8803ecb33ca0 R15: 
>> >> ffff8803ecb33ca8
>> >> [34152.604004] FS:  00007f9cf6dae700(0000) GS:ffff8803ffd80000(0000)
>> >> knlGS:0000000000000000
>> >> [34152.672920] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> [34152.672920] CR2: 00007f9ce80091a8 CR3: 00000003e50fe000 CR4: 
>> >> 00000000000007e0
>> >> [34152.672920] Stack:
>> >> [34152.672920]  00000001ecb33de8 0000000000016180 00007f9cf0024fff
>> >> ffff8803eb726900
>> >> [34152.672920]  ffff8803eb726bd0 00007f9cf0025000 00007f9cf0021000
>> >> 0000000000000004
>> >> [34152.672920]  ffff8803ecb33d68 ffffffff8106722e ffff8803ecb33d68
>> >> ffff8803eb726900
>> >> [34152.672920] Call Trace:
>> >> [34152.672920]  [<ffffffff8106722e>] native_flush_tlb_others+0x2e/0x30
>> >> [34152.672920]  [<ffffffff81067354>] flush_tlb_mm_range+0x64/0x170
>> >> [34152.672920]  [<ffffffff8119e66e>] tlb_flush_mmu_tlbonly+0x7e/0xe0
>> >> [34152.672920]  [<ffffffff8119eed4>] tlb_finish_mmu+0x14/0x50
>> >> [34152.672920]  [<ffffffff811a0cea>] zap_page_range+0xca/0x100
>> >> [34152.672920]  [<ffffffff811b3993>] SyS_madvise+0x363/0x790
>> >> [34152.672920]  [<ffffffff817905ed>] system_call_fastpath+0x16/0x1b
>> >> [34152.672920] Code: 9d 5c 2b 00 3b 05 7b b2 c2 00 89 c2 0f 8d 83 fe
>> >> ff ff 48 98 49 8b 4d 00 48 03 0c c5 40 b1 d1 81 f6 41 18 01 74 cb
>> >>  0f 1f 00 f3 90 <f6> 41 18 01 75 f8 eb be 0f b6 4d c4 4c 89 fa 4c 89 f6 
>> >> 44 89 ef
>> >>
>> >> So I'm not totally sure if this is a btrfs problem, or something else
>> >> got broken in 4.0.0-rc1.
>> >>
>> >> Maybe someone can have a look.
>> >>
>> >> If you need more information just let me know.
>> >
>> > Is it reproducible?
>> >
>> > From the stacks about btrfs, it's been stopped at mount stage, so it's 
>> > likely to
>> > be unrelated to btrfs.
>> >
>> > Thanks,
>> >
>> > -liubo
>> >
>> >>
>> >> Bye,
>> >>    Marcel
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> >> the body of a message to majord...@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regression: kernel 4.0.0-rc1 - soft lockups

Reply via email to