3.14 stable regression don't remove from shrink list in select_collect()
I recently updated some machines to 3.14.58 and they reliably get soft lockups. Sometimes the soft lockup recovers and some times it does not. I've bisected this on the 3.14 stable branch and arrived at: c214cb82cdc744225d85899fc138251527f75fff don't remove from shrink list in select_collect() Reverting this commit plus adding back dentry_lru_del() which was removed later on top of 3.14.58 resolves the issue for me. I've included a patch at the bottom with the revert. So far this issue has been easy to reproduce for me so I'm happy to try other patches for further debugging or testing. I have not yet tried latest upstream to see if it also has the issue. Below are the Soft lockup messages: [ 76.423941] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613] [ 76.538222] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd ipv6 autofs4 [ 76.538260] CPU: 10 PID: 3613 Comm: systemd-udevd Tainted: G O 3.14.50-00038-gc214cb8 #22 [ 76.538261] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.3.0 07/24/2012 [ 76.538262] task: 881806e45280 ti: 8817823d6000 task.ti: 8817823d6000 [ 76.538263] RIP: 0010:[] [] _raw_spin_trylock+0x2c/0x40 [ 76.538270] RSP: 0018:8817823d7b40 EFLAGS: 0246 [ 76.538271] RAX: dd75dd75 RBX: 8817823d7ae8 RCX: dd76dd75 [ 76.538272] RDX: dd75dd75 RSI: RDI: 88178509e638 [ 76.538273] RBP: 8817823d7b40 R08: 88178354a1e0 R09: 000180220001 [ 76.538274] R10: 811c9b96 R11: ea005e0d5280 R12: 880b7f097198 [ 76.538275] R13: 8817774d2c30 R14: 88178354a1e0 R15: 88178354a1e0 [ 76.538276] FS: 7f20b4868880() GS:88180f2a() knlGS: [ 76.538277] CS: 0010 DS: ES: CR0: 80050033 [ 76.538278] CR2: 7f20b61e6f58 CR3: 00180652f000 CR4: 07e0 [ 76.538279] Stack: [ 76.538280] 8817823d7b70 8116eff6 880b7ac5ef80 8817823d7bc0 [ 76.538283] 880b7ac5ef00 880b7ac5ef00 8817823d7ba8 8116f459 [ 76.538285] 880b7ac5ef80 8817823d7bc0 880b7ac5ecc0 000c [ 76.538287] Call Trace: [ 76.538292] [] dentry_kill+0x36/0x290 [ 76.538294] [] shrink_dentry_list+0x79/0xd0 [ 76.538296] [] check_submounts_and_drop+0x74/0xa0 [ 76.538301] [] kernfs_dop_revalidate+0x5c/0xd0 [ 76.538306] [] lookup_fast+0x26d/0x2c0 [ 76.538307] [] link_path_walk+0x1d9/0x890 [ 76.538311] [] ? kmem_cache_alloc+0x31/0x140 [ 76.538313] [] ? kernfs_name_hash+0x17/0xd0 [ 76.538315] [] ? __mutex_unlock_slowpath+0x16/0x40 [ 76.538317] [] path_lookupat+0x5b/0x770 [ 76.538318] [] ? __d_free+0x35/0x40 [ 76.538320] [] ? dentry_kill+0x215/0x290 [ 76.538321] [] ? kmem_cache_alloc+0x31/0x140 [ 76.538323] [] ? getname_flags+0x2c/0x120 [ 76.538325] [] filename_lookup.isra.50+0x26/0x60 [ 76.538327] [] user_path_at_empty+0x54/0x90 [ 76.538329] [] ? final_putname+0x22/0x50 [ 76.538330] [] ? user_path_at_empty+0x5f/0x90 [ 76.538332] [] user_path_at+0x11/0x20 [ 76.538334] [] vfs_fstatat+0x50/0xa0 [ 76.538336] [] SYSC_newlstat+0x22/0x40 [ 76.538338] [] ? SyS_readlink+0x4c/0x110 [ 76.538339] [] SyS_newlstat+0xe/0x10 [ 76.538343] [] system_call_fastpath+0x16/0x1b [ 76.538344] Code: 66 66 66 90 55 48 89 e5 8b 17 89 d0 c1 e8 10 66 39 c2 74 0b 31 c0 5d c3 0f 1f 80 00 00 00 00 8d 8a 00 00 01 00 89 d0 f0 0f b1 0f <39> d0 75 e5 b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 66 [ 104.426665] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613] [ 104.537859] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd ipv6 autofs4 [ 104.537895] CPU: 10 PID:
3.14 stable regression don't remove from shrink list in select_collect()
I recently updated some machines to 3.14.58 and they reliably get soft lockups. Sometimes the soft lockup recovers and some times it does not. I've bisected this on the 3.14 stable branch and arrived at: c214cb82cdc744225d85899fc138251527f75fff don't remove from shrink list in select_collect() Reverting this commit plus adding back dentry_lru_del() which was removed later on top of 3.14.58 resolves the issue for me. I've included a patch at the bottom with the revert. So far this issue has been easy to reproduce for me so I'm happy to try other patches for further debugging or testing. I have not yet tried latest upstream to see if it also has the issue. Below are the Soft lockup messages: [ 76.423941] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613] [ 76.538222] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd ipv6 autofs4 [ 76.538260] CPU: 10 PID: 3613 Comm: systemd-udevd Tainted: G O 3.14.50-00038-gc214cb8 #22 [ 76.538261] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.3.0 07/24/2012 [ 76.538262] task: 881806e45280 ti: 8817823d6000 task.ti: 8817823d6000 [ 76.538263] RIP: 0010:[] [] _raw_spin_trylock+0x2c/0x40 [ 76.538270] RSP: 0018:8817823d7b40 EFLAGS: 0246 [ 76.538271] RAX: dd75dd75 RBX: 8817823d7ae8 RCX: dd76dd75 [ 76.538272] RDX: dd75dd75 RSI: RDI: 88178509e638 [ 76.538273] RBP: 8817823d7b40 R08: 88178354a1e0 R09: 000180220001 [ 76.538274] R10: 811c9b96 R11: ea005e0d5280 R12: 880b7f097198 [ 76.538275] R13: 8817774d2c30 R14: 88178354a1e0 R15: 88178354a1e0 [ 76.538276] FS: 7f20b4868880() GS:88180f2a() knlGS: [ 76.538277] CS: 0010 DS: ES: CR0: 80050033 [ 76.538278] CR2: 7f20b61e6f58 CR3: 00180652f000 CR4: 07e0 [ 76.538279] Stack: [ 76.538280] 8817823d7b70 8116eff6 880b7ac5ef80 8817823d7bc0 [ 76.538283] 880b7ac5ef00 880b7ac5ef00 8817823d7ba8 8116f459 [ 76.538285] 880b7ac5ef80 8817823d7bc0 880b7ac5ecc0 000c [ 76.538287] Call Trace: [ 76.538292] [] dentry_kill+0x36/0x290 [ 76.538294] [] shrink_dentry_list+0x79/0xd0 [ 76.538296] [] check_submounts_and_drop+0x74/0xa0 [ 76.538301] [] kernfs_dop_revalidate+0x5c/0xd0 [ 76.538306] [] lookup_fast+0x26d/0x2c0 [ 76.538307] [] link_path_walk+0x1d9/0x890 [ 76.538311] [] ? kmem_cache_alloc+0x31/0x140 [ 76.538313] [] ? kernfs_name_hash+0x17/0xd0 [ 76.538315] [] ? __mutex_unlock_slowpath+0x16/0x40 [ 76.538317] [] path_lookupat+0x5b/0x770 [ 76.538318] [] ? __d_free+0x35/0x40 [ 76.538320] [] ? dentry_kill+0x215/0x290 [ 76.538321] [] ? kmem_cache_alloc+0x31/0x140 [ 76.538323] [] ? getname_flags+0x2c/0x120 [ 76.538325] [] filename_lookup.isra.50+0x26/0x60 [ 76.538327] [] user_path_at_empty+0x54/0x90 [ 76.538329] [] ? final_putname+0x22/0x50 [ 76.538330] [] ? user_path_at_empty+0x5f/0x90 [ 76.538332] [] user_path_at+0x11/0x20 [ 76.538334] [] vfs_fstatat+0x50/0xa0 [ 76.538336] [] SYSC_newlstat+0x22/0x40 [ 76.538338] [] ? SyS_readlink+0x4c/0x110 [ 76.538339] [] SyS_newlstat+0xe/0x10 [ 76.538343] [] system_call_fastpath+0x16/0x1b [ 76.538344] Code: 66 66 66 90 55 48 89 e5 8b 17 89 d0 c1 e8 10 66 39 c2 74 0b 31 c0 5d c3 0f 1f 80 00 00 00 00 8d 8a 00 00 01 00 89 d0 f0 0f b1 0f <39> d0 75 e5 b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 66 [ 104.426665] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613] [ 104.537859] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd ipv6 autofs4 [ 104.537895] CPU: 10 PID:
Re: NFS Freezer and stuck tasks
On Fri, May 01, 2015 at 05:10:34PM -0400, Benjamin Coddington wrote: > On Fri, 1 May 2015, Benjamin Coddington wrote: > > > On Wed, 4 Mar 2015, Shawn Bohrer wrote: > > > > > Hello, > > > > > > We're using the Linux cgroup Freezer on some machines that use NFS and > > > have run into what appears to be a bug where frozen tasks are blocking > > > running tasks and preventing them from completing. On one of our > > > machines which happens to be running an older 3.10.46 kernel we have > > > frozen some of the tasks on the system using the cgroup Freezer. We > > > also have a separate set of tasks which are NOT frozen which are stuck > > > trying to open some files on NFS. > > > > > > Looking at the frozen tasks there are several that have the following > > > stack: > > > > > > [] rpc_wait_bit_killable+0x35/0x80 > > > [] __rpc_wait_for_completion_task+0x2d/0x30 > > > [] nfs4_run_open_task+0x11d/0x170 > > > [] _nfs4_open_and_get_state+0x53/0x260 > > > [] nfs4_do_open+0x121/0x400 > > > [] nfs4_atomic_open+0x31/0x50 > > > [] nfs4_file_open+0xac/0x180 > > > [] do_dentry_open.isra.19+0x1ee/0x280 > > > [] finish_open+0x1e/0x30 > > > [] do_last.isra.64+0x2c2/0xc40 > > > [] path_openat.isra.65+0x2c9/0x490 > > > [] do_filp_open+0x38/0x80 > > > [] do_sys_open+0xe4/0x1c0 > > > [] SyS_open+0x1e/0x20 > > > [] system_call_fastpath+0x16/0x1b > > > [] 0x > > > > > > Here it looks like we are waiting in a wait queue inside > > > rpc_wait_bit_killable() for RPC_TASK_ACTIVE. > > > > > > And there is a single task with a stack that looks like the following: > > > > > > [] __refrigerator+0x55/0x150 > > > [] rpc_wait_bit_killable+0x66/0x80 > > > [] __rpc_wait_for_completion_task+0x2d/0x30 > > > [] nfs4_run_open_task+0x11d/0x170 > > > [] _nfs4_open_and_get_state+0x53/0x260 > > > [] nfs4_do_open+0x121/0x400 > > > [] nfs4_atomic_open+0x31/0x50 > > > [] nfs4_file_open+0xac/0x180 > > > [] do_dentry_open.isra.19+0x1ee/0x280 > > > [] finish_open+0x1e/0x30 > > > [] do_last.isra.64+0x2c2/0xc40 > > > [] path_openat.isra.65+0x2c9/0x490 > > > [] do_filp_open+0x38/0x80 > > > [] do_sys_open+0xe4/0x1c0 > > > [] SyS_open+0x1e/0x20 > > > [] system_call_fastpath+0x16/0x1b > > > [] 0x > > > > > > This looks similar but the different offset into > > > rpc_wait_bit_killable() shows that we have returned from the > > > schedule() call in freezable_schedule() and are now blocked in > > > __refrigerator() inside freezer_count() > > > > > > Similarly if you look at the tasks that are NOT frozen but are stuck > > > opening a NFS file, they also have the following stack showing they are > > > waiting in the wait queue for RPC_TASK_ACTIVE. > > > > > > [] rpc_wait_bit_killable+0x35/0x80 > > > [] __rpc_wait_for_completion_task+0x2d/0x30 > > > [] nfs4_run_open_task+0x11d/0x170 > > > [] _nfs4_open_and_get_state+0x53/0x260 > > > [] nfs4_do_open+0x121/0x400 > > > [] nfs4_atomic_open+0x31/0x50 > > > [] nfs4_file_open+0xac/0x180 > > > [] do_dentry_open.isra.19+0x1ee/0x280 > > > [] finish_open+0x1e/0x30 > > > [] do_last.isra.64+0x2c2/0xc40 > > > [] path_openat.isra.65+0x2c9/0x490 > > > [] do_filp_open+0x38/0x80 > > > [] do_sys_open+0xe4/0x1c0 > > > [] SyS_open+0x1e/0x20 > > > [] system_call_fastpath+0x16/0x1b > > > [] 0x > > > > > > We have hit this a couple of times now and know that if we THAW all of > > > the frozen tasks that running tasks will unwedge and finish. > > > > > > Additionally we have also tried thawing the single task that is frozen > > > in __refrigerator() inside rpc_wait_bit_killable(). This usually > > > results in different frozen task entering the __refrigerator() state > > > inside rpc_wait_bit_killable(). It looks like each one of those tasks > > > must wake up another letting it progress. Again if you thaw enough of > > > the frozen tasks eventually everything unwedges and everything > > > completes. > > > > > > I've looked through the 3.10 stable patches since 3.10.46 and don't > > > see anything that looks like it addresses this. Does anyone have any > > > idea what might be going on here, and what the fix might be? > > > > > &g
Re: NFS Freezer and stuck tasks
On Fri, May 01, 2015 at 05:10:34PM -0400, Benjamin Coddington wrote: On Fri, 1 May 2015, Benjamin Coddington wrote: On Wed, 4 Mar 2015, Shawn Bohrer wrote: Hello, We're using the Linux cgroup Freezer on some machines that use NFS and have run into what appears to be a bug where frozen tasks are blocking running tasks and preventing them from completing. On one of our machines which happens to be running an older 3.10.46 kernel we have frozen some of the tasks on the system using the cgroup Freezer. We also have a separate set of tasks which are NOT frozen which are stuck trying to open some files on NFS. Looking at the frozen tasks there are several that have the following stack: [814fd055] rpc_wait_bit_killable+0x35/0x80 [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30 [811dce5d] nfs4_run_open_task+0x11d/0x170 [811de7a3] _nfs4_open_and_get_state+0x53/0x260 [811e12d1] nfs4_do_open+0x121/0x400 [811e15e1] nfs4_atomic_open+0x31/0x50 [811f02dc] nfs4_file_open+0xac/0x180 [811479be] do_dentry_open.isra.19+0x1ee/0x280 [81147b3e] finish_open+0x1e/0x30 [811578d2] do_last.isra.64+0x2c2/0xc40 [81158519] path_openat.isra.65+0x2c9/0x490 [81158c38] do_filp_open+0x38/0x80 [81148cd4] do_sys_open+0xe4/0x1c0 [81148dce] SyS_open+0x1e/0x20 [8153e719] system_call_fastpath+0x16/0x1b [] 0x Here it looks like we are waiting in a wait queue inside rpc_wait_bit_killable() for RPC_TASK_ACTIVE. And there is a single task with a stack that looks like the following: [8107dc05] __refrigerator+0x55/0x150 [814fd086] rpc_wait_bit_killable+0x66/0x80 [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30 [811dce5d] nfs4_run_open_task+0x11d/0x170 [811de7a3] _nfs4_open_and_get_state+0x53/0x260 [811e12d1] nfs4_do_open+0x121/0x400 [811e15e1] nfs4_atomic_open+0x31/0x50 [811f02dc] nfs4_file_open+0xac/0x180 [811479be] do_dentry_open.isra.19+0x1ee/0x280 [81147b3e] finish_open+0x1e/0x30 [811578d2] do_last.isra.64+0x2c2/0xc40 [81158519] path_openat.isra.65+0x2c9/0x490 [81158c38] do_filp_open+0x38/0x80 [81148cd4] do_sys_open+0xe4/0x1c0 [81148dce] SyS_open+0x1e/0x20 [8153e719] system_call_fastpath+0x16/0x1b [] 0x This looks similar but the different offset into rpc_wait_bit_killable() shows that we have returned from the schedule() call in freezable_schedule() and are now blocked in __refrigerator() inside freezer_count() Similarly if you look at the tasks that are NOT frozen but are stuck opening a NFS file, they also have the following stack showing they are waiting in the wait queue for RPC_TASK_ACTIVE. [814fd055] rpc_wait_bit_killable+0x35/0x80 [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30 [811dce5d] nfs4_run_open_task+0x11d/0x170 [811de7a3] _nfs4_open_and_get_state+0x53/0x260 [811e12d1] nfs4_do_open+0x121/0x400 [811e15e1] nfs4_atomic_open+0x31/0x50 [811f02dc] nfs4_file_open+0xac/0x180 [811479be] do_dentry_open.isra.19+0x1ee/0x280 [81147b3e] finish_open+0x1e/0x30 [811578d2] do_last.isra.64+0x2c2/0xc40 [81158519] path_openat.isra.65+0x2c9/0x490 [81158c38] do_filp_open+0x38/0x80 [81148cd4] do_sys_open+0xe4/0x1c0 [81148dce] SyS_open+0x1e/0x20 [8153e719] system_call_fastpath+0x16/0x1b [] 0x We have hit this a couple of times now and know that if we THAW all of the frozen tasks that running tasks will unwedge and finish. Additionally we have also tried thawing the single task that is frozen in __refrigerator() inside rpc_wait_bit_killable(). This usually results in different frozen task entering the __refrigerator() state inside rpc_wait_bit_killable(). It looks like each one of those tasks must wake up another letting it progress. Again if you thaw enough of the frozen tasks eventually everything unwedges and everything completes. I've looked through the 3.10 stable patches since 3.10.46 and don't see anything that looks like it addresses this. Does anyone have any idea what might be going on here, and what the fix might be? Thanks, Shawn Hi Shawn, just started looking at this myself, and as Frank Sorensen points out in https://bugzilla.redhat.com/show_bug.cgi?id=1209143 the problem is that a task takes the xprt lock and then ends up in the refrigerator effectively blocking other tasks from proceeding. Jeff, any suggestions on how to proceed here? Sorry for the noise, and self
Re: HugePages_Rsvd leak
On Wed, Apr 08, 2015 at 02:16:05PM -0700, Mike Kravetz wrote: > On 04/08/2015 09:15 AM, Shawn Bohrer wrote: > >I've noticed on a number of my systems that after shutting down my > >application that uses huge pages that I'm left with some pages still > >in HugePages_Rsvd. It is possible that I still have something using > >huge pages that I'm not aware of but so far my attempts to find > >anything using huge pages have failed. I've run some simple tests > >using map_hugetlb.c from the kernel source and can see that pages that > >have been reserved but not allocated still show up in > >/proc//smaps and /proc//numa_maps. Are there any cases > >where this is not true? > > Just a quick question. Are you using hugetlb filesystem(s)? I can't say for sure that nothing is using hugetlbfs. It is mounted but as far as I can tell on the affected system(s) it is empty. [root@dev106 ~]# grep hugetlbfs /proc/mounts hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 [root@dev106 ~]# ls -al /dev/hugepages/ total 0 drwxr-xr-x 2 root root0 Apr 8 16:22 . drwxr-xr-x 16 root root 4360 Apr 8 03:53 .. [root@dev106 ~]# lsof | grep hugepages > If so, you might want to take a look at files residing in the > filesystem(s). As an experiment, I had a program do a simple > mmap() of a file in a hugetlb filesystem. The program just > created the mapping, and did not actually fault/allocate any > huge pages. The result was the reservation (HugePages_Rsvd) > of sufficient huge pages to cover the mapping. When the program > exited, the reservations remained. If I remove (unlink) the > file the reservations will be removed. That makes sense but I don't think it is the issue here. Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: HugePages_Rsvd leak
On Wed, Apr 08, 2015 at 12:29:03PM -0700, Davidlohr Bueso wrote: > On Wed, 2015-04-08 at 11:15 -0500, Shawn Bohrer wrote: > > AnonHugePages:241664 kB > > HugePages_Total: 512 > > HugePages_Free: 512 > > HugePages_Rsvd: 384 > > HugePages_Surp:0 > > Hugepagesize: 2048 kB > > > > So here I have 384 pages reserved and I can't find anything that is > > using them. > > The output clearly shows all available hugepages are free, Why are you > assuming that reserved implies allocated/in use? This is not true, > please read one of the millions of docs out there -- you can start with: > https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt As that fine document states: HugePages_Rsvd is short for "reserved," and is the number of huge pages for which a commitment to allocate from the pool has been made, but no allocation has yet been made. Reserved huge pages guarantee that an application will be able to allocate a huge page from the pool of huge pages at fault time. Thus in my example above while I have 512 pages free 384 are reserved and therefore if a new application comes along it can only reserve/use the remaining 128 pages. For example: [scratch]$ grep Huge /proc/meminfo AnonHugePages: 0 kB HugePages_Total: 1 HugePages_Free:1 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB [scratch]$ cat map_hugetlb.c #include #include #include #include #define LENGTH (2UL*1024*1024) #define PROTECTION (PROT_READ | PROT_WRITE) #define ADDR (void *)(0x0UL) #define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) int main(void) { void *addr; addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0); if (addr == MAP_FAILED) { perror("mmap"); exit(1); } getchar(); munmap(addr, LENGTH); return 0; } [scratch]$ make map_hugetlb cc map_hugetlb.c -o map_hugetlb [scratch]$ ./map_hugetlb & [1] 7359 [1]+ Stopped ./map_hugetlb [scratch]$ grep Huge /proc/meminfo AnonHugePages: 0 kB HugePages_Total: 1 HugePages_Free:1 HugePages_Rsvd:1 HugePages_Surp:0 Hugepagesize: 2048 kB [scratch]$ ./map_hugetlb mmap: Cannot allocate memory As you can see I still have 1 huge page free but that one huge page is reserved by PID 7358. If I then try to run a new map_hugetlb process the mmap fails because even though I have 1 page free it is reserved. Furthermore we can find that 7358 has that page in the following ways: [scratch]$ sudo grep "KernelPageSize:.*2048" /proc/*/smaps /proc/7359/smaps:KernelPageSize: 2048 kB [scratch]$ sudo grep "VmFlags:.*ht" /proc/*/smaps /proc/7359/smaps:VmFlags: rd wr mr mw me de ht sd [scratch]$ sudo grep -w huge /proc/*/numa_maps /proc/7359/numa_maps:7f323300 default file=/anon_hugepage\040(deleted) huge Which leads back to my original question. I have machines that have a non-zero HugePages_Rsvd count but I cannot find any processes that seem to have those pages reserved using the three methods shown above. Is there some other way to identify which process has those pages reserved? Or is there possibly a leak which is failing to decrement the reserve count? Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
HugePages_Rsvd leak
I've noticed on a number of my systems that after shutting down my application that uses huge pages that I'm left with some pages still in HugePages_Rsvd. It is possible that I still have something using huge pages that I'm not aware of but so far my attempts to find anything using huge pages have failed. I've run some simple tests using map_hugetlb.c from the kernel source and can see that pages that have been reserved but not allocated still show up in /proc//smaps and /proc//numa_maps. Are there any cases where this is not true? [root@dev106 ~]# grep HugePages /proc/meminfo AnonHugePages:241664 kB HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 384 HugePages_Surp:0 Hugepagesize: 2048 kB [root@dev106 ~]# grep "KernelPageSize:.*2048" /proc/*/smaps [root@dev106 ~]# grep "VmFlags:.*ht" /proc/*/smaps [root@dev106 ~]# grep huge /proc/*/numa_maps [root@dev106 ~]# grep Huge /proc/meminfo AnonHugePages:241664 kB HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 384 HugePages_Surp:0 Hugepagesize: 2048 kB So here I have 384 pages reserved and I can't find anything that is using them. This is on a machine running 3.14.33. I can possibly try running a newer kernel if there is a belief that this has been fixed. I'm also happy to provide more information or try some debug patches if there are ideas on how to track this down. I'm not entirely sure how hard this is to reproduce but nearly every machine I've looked at is in this state so it must not be too hard. Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
HugePages_Rsvd leak
I've noticed on a number of my systems that after shutting down my application that uses huge pages that I'm left with some pages still in HugePages_Rsvd. It is possible that I still have something using huge pages that I'm not aware of but so far my attempts to find anything using huge pages have failed. I've run some simple tests using map_hugetlb.c from the kernel source and can see that pages that have been reserved but not allocated still show up in /proc/pid/smaps and /proc/pid/numa_maps. Are there any cases where this is not true? [root@dev106 ~]# grep HugePages /proc/meminfo AnonHugePages:241664 kB HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 384 HugePages_Surp:0 Hugepagesize: 2048 kB [root@dev106 ~]# grep KernelPageSize:.*2048 /proc/*/smaps [root@dev106 ~]# grep VmFlags:.*ht /proc/*/smaps [root@dev106 ~]# grep huge /proc/*/numa_maps [root@dev106 ~]# grep Huge /proc/meminfo AnonHugePages:241664 kB HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 384 HugePages_Surp:0 Hugepagesize: 2048 kB So here I have 384 pages reserved and I can't find anything that is using them. This is on a machine running 3.14.33. I can possibly try running a newer kernel if there is a belief that this has been fixed. I'm also happy to provide more information or try some debug patches if there are ideas on how to track this down. I'm not entirely sure how hard this is to reproduce but nearly every machine I've looked at is in this state so it must not be too hard. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: HugePages_Rsvd leak
On Wed, Apr 08, 2015 at 12:29:03PM -0700, Davidlohr Bueso wrote: On Wed, 2015-04-08 at 11:15 -0500, Shawn Bohrer wrote: AnonHugePages:241664 kB HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 384 HugePages_Surp:0 Hugepagesize: 2048 kB So here I have 384 pages reserved and I can't find anything that is using them. The output clearly shows all available hugepages are free, Why are you assuming that reserved implies allocated/in use? This is not true, please read one of the millions of docs out there -- you can start with: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt As that fine document states: HugePages_Rsvd is short for reserved, and is the number of huge pages for which a commitment to allocate from the pool has been made, but no allocation has yet been made. Reserved huge pages guarantee that an application will be able to allocate a huge page from the pool of huge pages at fault time. Thus in my example above while I have 512 pages free 384 are reserved and therefore if a new application comes along it can only reserve/use the remaining 128 pages. For example: [scratch]$ grep Huge /proc/meminfo AnonHugePages: 0 kB HugePages_Total: 1 HugePages_Free:1 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB [scratch]$ cat map_hugetlb.c #include stdlib.h #include stdio.h #include unistd.h #include sys/mman.h #define LENGTH (2UL*1024*1024) #define PROTECTION (PROT_READ | PROT_WRITE) #define ADDR (void *)(0x0UL) #define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) int main(void) { void *addr; addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0); if (addr == MAP_FAILED) { perror(mmap); exit(1); } getchar(); munmap(addr, LENGTH); return 0; } [scratch]$ make map_hugetlb cc map_hugetlb.c -o map_hugetlb [scratch]$ ./map_hugetlb [1] 7359 [1]+ Stopped ./map_hugetlb [scratch]$ grep Huge /proc/meminfo AnonHugePages: 0 kB HugePages_Total: 1 HugePages_Free:1 HugePages_Rsvd:1 HugePages_Surp:0 Hugepagesize: 2048 kB [scratch]$ ./map_hugetlb mmap: Cannot allocate memory As you can see I still have 1 huge page free but that one huge page is reserved by PID 7358. If I then try to run a new map_hugetlb process the mmap fails because even though I have 1 page free it is reserved. Furthermore we can find that 7358 has that page in the following ways: [scratch]$ sudo grep KernelPageSize:.*2048 /proc/*/smaps /proc/7359/smaps:KernelPageSize: 2048 kB [scratch]$ sudo grep VmFlags:.*ht /proc/*/smaps /proc/7359/smaps:VmFlags: rd wr mr mw me de ht sd [scratch]$ sudo grep -w huge /proc/*/numa_maps /proc/7359/numa_maps:7f323300 default file=/anon_hugepage\040(deleted) huge Which leads back to my original question. I have machines that have a non-zero HugePages_Rsvd count but I cannot find any processes that seem to have those pages reserved using the three methods shown above. Is there some other way to identify which process has those pages reserved? Or is there possibly a leak which is failing to decrement the reserve count? Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: HugePages_Rsvd leak
On Wed, Apr 08, 2015 at 02:16:05PM -0700, Mike Kravetz wrote: On 04/08/2015 09:15 AM, Shawn Bohrer wrote: I've noticed on a number of my systems that after shutting down my application that uses huge pages that I'm left with some pages still in HugePages_Rsvd. It is possible that I still have something using huge pages that I'm not aware of but so far my attempts to find anything using huge pages have failed. I've run some simple tests using map_hugetlb.c from the kernel source and can see that pages that have been reserved but not allocated still show up in /proc/pid/smaps and /proc/pid/numa_maps. Are there any cases where this is not true? Just a quick question. Are you using hugetlb filesystem(s)? I can't say for sure that nothing is using hugetlbfs. It is mounted but as far as I can tell on the affected system(s) it is empty. [root@dev106 ~]# grep hugetlbfs /proc/mounts hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 [root@dev106 ~]# ls -al /dev/hugepages/ total 0 drwxr-xr-x 2 root root0 Apr 8 16:22 . drwxr-xr-x 16 root root 4360 Apr 8 03:53 .. [root@dev106 ~]# lsof | grep hugepages If so, you might want to take a look at files residing in the filesystem(s). As an experiment, I had a program do a simple mmap() of a file in a hugetlb filesystem. The program just created the mapping, and did not actually fault/allocate any huge pages. The result was the reservation (HugePages_Rsvd) of sufficient huge pages to cover the mapping. When the program exited, the reservations remained. If I remove (unlink) the file the reservations will be removed. That makes sense but I don't think it is the issue here. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NFS Freezer and stuck tasks
Hello, We're using the Linux cgroup Freezer on some machines that use NFS and have run into what appears to be a bug where frozen tasks are blocking running tasks and preventing them from completing. On one of our machines which happens to be running an older 3.10.46 kernel we have frozen some of the tasks on the system using the cgroup Freezer. We also have a separate set of tasks which are NOT frozen which are stuck trying to open some files on NFS. Looking at the frozen tasks there are several that have the following stack: [] rpc_wait_bit_killable+0x35/0x80 [] __rpc_wait_for_completion_task+0x2d/0x30 [] nfs4_run_open_task+0x11d/0x170 [] _nfs4_open_and_get_state+0x53/0x260 [] nfs4_do_open+0x121/0x400 [] nfs4_atomic_open+0x31/0x50 [] nfs4_file_open+0xac/0x180 [] do_dentry_open.isra.19+0x1ee/0x280 [] finish_open+0x1e/0x30 [] do_last.isra.64+0x2c2/0xc40 [] path_openat.isra.65+0x2c9/0x490 [] do_filp_open+0x38/0x80 [] do_sys_open+0xe4/0x1c0 [] SyS_open+0x1e/0x20 [] system_call_fastpath+0x16/0x1b [] 0x Here it looks like we are waiting in a wait queue inside rpc_wait_bit_killable() for RPC_TASK_ACTIVE. And there is a single task with a stack that looks like the following: [] __refrigerator+0x55/0x150 [] rpc_wait_bit_killable+0x66/0x80 [] __rpc_wait_for_completion_task+0x2d/0x30 [] nfs4_run_open_task+0x11d/0x170 [] _nfs4_open_and_get_state+0x53/0x260 [] nfs4_do_open+0x121/0x400 [] nfs4_atomic_open+0x31/0x50 [] nfs4_file_open+0xac/0x180 [] do_dentry_open.isra.19+0x1ee/0x280 [] finish_open+0x1e/0x30 [] do_last.isra.64+0x2c2/0xc40 [] path_openat.isra.65+0x2c9/0x490 [] do_filp_open+0x38/0x80 [] do_sys_open+0xe4/0x1c0 [] SyS_open+0x1e/0x20 [] system_call_fastpath+0x16/0x1b [] 0x This looks similar but the different offset into rpc_wait_bit_killable() shows that we have returned from the schedule() call in freezable_schedule() and are now blocked in __refrigerator() inside freezer_count() Similarly if you look at the tasks that are NOT frozen but are stuck opening a NFS file, they also have the following stack showing they are waiting in the wait queue for RPC_TASK_ACTIVE. [] rpc_wait_bit_killable+0x35/0x80 [] __rpc_wait_for_completion_task+0x2d/0x30 [] nfs4_run_open_task+0x11d/0x170 [] _nfs4_open_and_get_state+0x53/0x260 [] nfs4_do_open+0x121/0x400 [] nfs4_atomic_open+0x31/0x50 [] nfs4_file_open+0xac/0x180 [] do_dentry_open.isra.19+0x1ee/0x280 [] finish_open+0x1e/0x30 [] do_last.isra.64+0x2c2/0xc40 [] path_openat.isra.65+0x2c9/0x490 [] do_filp_open+0x38/0x80 [] do_sys_open+0xe4/0x1c0 [] SyS_open+0x1e/0x20 [] system_call_fastpath+0x16/0x1b [] 0x We have hit this a couple of times now and know that if we THAW all of the frozen tasks that running tasks will unwedge and finish. Additionally we have also tried thawing the single task that is frozen in __refrigerator() inside rpc_wait_bit_killable(). This usually results in different frozen task entering the __refrigerator() state inside rpc_wait_bit_killable(). It looks like each one of those tasks must wake up another letting it progress. Again if you thaw enough of the frozen tasks eventually everything unwedges and everything completes. I've looked through the 3.10 stable patches since 3.10.46 and don't see anything that looks like it addresses this. Does anyone have any idea what might be going on here, and what the fix might be? Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NFS Freezer and stuck tasks
Hello, We're using the Linux cgroup Freezer on some machines that use NFS and have run into what appears to be a bug where frozen tasks are blocking running tasks and preventing them from completing. On one of our machines which happens to be running an older 3.10.46 kernel we have frozen some of the tasks on the system using the cgroup Freezer. We also have a separate set of tasks which are NOT frozen which are stuck trying to open some files on NFS. Looking at the frozen tasks there are several that have the following stack: [814fd055] rpc_wait_bit_killable+0x35/0x80 [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30 [811dce5d] nfs4_run_open_task+0x11d/0x170 [811de7a3] _nfs4_open_and_get_state+0x53/0x260 [811e12d1] nfs4_do_open+0x121/0x400 [811e15e1] nfs4_atomic_open+0x31/0x50 [811f02dc] nfs4_file_open+0xac/0x180 [811479be] do_dentry_open.isra.19+0x1ee/0x280 [81147b3e] finish_open+0x1e/0x30 [811578d2] do_last.isra.64+0x2c2/0xc40 [81158519] path_openat.isra.65+0x2c9/0x490 [81158c38] do_filp_open+0x38/0x80 [81148cd4] do_sys_open+0xe4/0x1c0 [81148dce] SyS_open+0x1e/0x20 [8153e719] system_call_fastpath+0x16/0x1b [] 0x Here it looks like we are waiting in a wait queue inside rpc_wait_bit_killable() for RPC_TASK_ACTIVE. And there is a single task with a stack that looks like the following: [8107dc05] __refrigerator+0x55/0x150 [814fd086] rpc_wait_bit_killable+0x66/0x80 [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30 [811dce5d] nfs4_run_open_task+0x11d/0x170 [811de7a3] _nfs4_open_and_get_state+0x53/0x260 [811e12d1] nfs4_do_open+0x121/0x400 [811e15e1] nfs4_atomic_open+0x31/0x50 [811f02dc] nfs4_file_open+0xac/0x180 [811479be] do_dentry_open.isra.19+0x1ee/0x280 [81147b3e] finish_open+0x1e/0x30 [811578d2] do_last.isra.64+0x2c2/0xc40 [81158519] path_openat.isra.65+0x2c9/0x490 [81158c38] do_filp_open+0x38/0x80 [81148cd4] do_sys_open+0xe4/0x1c0 [81148dce] SyS_open+0x1e/0x20 [8153e719] system_call_fastpath+0x16/0x1b [] 0x This looks similar but the different offset into rpc_wait_bit_killable() shows that we have returned from the schedule() call in freezable_schedule() and are now blocked in __refrigerator() inside freezer_count() Similarly if you look at the tasks that are NOT frozen but are stuck opening a NFS file, they also have the following stack showing they are waiting in the wait queue for RPC_TASK_ACTIVE. [814fd055] rpc_wait_bit_killable+0x35/0x80 [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30 [811dce5d] nfs4_run_open_task+0x11d/0x170 [811de7a3] _nfs4_open_and_get_state+0x53/0x260 [811e12d1] nfs4_do_open+0x121/0x400 [811e15e1] nfs4_atomic_open+0x31/0x50 [811f02dc] nfs4_file_open+0xac/0x180 [811479be] do_dentry_open.isra.19+0x1ee/0x280 [81147b3e] finish_open+0x1e/0x30 [811578d2] do_last.isra.64+0x2c2/0xc40 [81158519] path_openat.isra.65+0x2c9/0x490 [81158c38] do_filp_open+0x38/0x80 [81148cd4] do_sys_open+0xe4/0x1c0 [81148dce] SyS_open+0x1e/0x20 [8153e719] system_call_fastpath+0x16/0x1b [] 0x We have hit this a couple of times now and know that if we THAW all of the frozen tasks that running tasks will unwedge and finish. Additionally we have also tried thawing the single task that is frozen in __refrigerator() inside rpc_wait_bit_killable(). This usually results in different frozen task entering the __refrigerator() state inside rpc_wait_bit_killable(). It looks like each one of those tasks must wake up another letting it progress. Again if you thaw enough of the frozen tasks eventually everything unwedges and everything completes. I've looked through the 3.10 stable patches since 3.10.46 and don't see anything that looks like it addresses this. Does anyone have any idea what might be going on here, and what the fix might be? Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] ib_umem_release should decrement mm->pinned_vm from ib_umem_get
On Wed, Sep 03, 2014 at 12:13:57PM -0500, Shawn Bohrer wrote: > From: Shawn Bohrer > > In debugging an application that receives -ENOMEM from ib_reg_mr() I > found that ib_umem_get() can fail because the pinned_vm count has > wrapped causing it to always be larger than the lock limit even with > RLIMIT_MEMLOCK set to RLIM_INFINITY. > > The wrapping of pinned_vm occurs because the process that calls > ib_reg_mr() will have its mm->pinned_vm count incremented. Later a > different process with a different mm_struct than the one that allocated > the ib_umem struct ends up releasing it which results in decrementing > the new processes mm->pinned_vm count past zero and wrapping. > > I'm not entirely sure what circumstances cause a different process to > release the ib_umem than the one that allocated it but the kernel stack > trace of the freeing process from my situation looks like the following: > > Call Trace: > [] dump_stack+0x19/0x1b > [] ib_umem_release+0x1f5/0x200 [ib_core] > [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] > [] ib_destroy_qp+0x12c/0x170 [ib_core] > [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] > [] __fput+0xba/0x240 > [] fput+0xe/0x10 > [] task_work_run+0xc4/0xe0 > [] do_notify_resume+0x95/0xa0 > [] int_signal+0x12/0x17 > > The following patch fixes the issue by storing the pid struct of the > process that calls ib_umem_get() so that ib_umem_release and/or > ib_umem_account() can properly decrement the pinned_vm count of the > correct mm_struct. > > Signed-off-by: Shawn Bohrer > --- > v3 changes: > * Fix resource leak with put_task_struct() > v2 changes: > * Updated to use get_task_pid to avoid keeping a reference to the mm > > drivers/infiniband/core/umem.c | 19 +-- > include/rdma/ib_umem.h |1 + > 2 files changed, 14 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index a3a2e9c..df0c4f6 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, > unsigned long addr, > umem->length= size; > umem->offset= addr & ~PAGE_MASK; > umem->page_size = PAGE_SIZE; > + umem->pid = get_task_pid(current, PIDTYPE_PID); > /* >* We ask for writable memory if any access flags other than >* "remote read" are set. "Local write" and "remote write" > @@ -198,6 +199,7 @@ out: > if (ret < 0) { > if (need_release) > __ib_umem_release(context->device, umem, 0); > + put_pid(umem->pid); > kfree(umem); > } else > current->mm->pinned_vm = locked; > @@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem) > { > struct ib_ucontext *context = umem->context; > struct mm_struct *mm; > + struct task_struct *task; > unsigned long diff; > > __ib_umem_release(umem->context->device, umem, 1); > > - mm = get_task_mm(current); > - if (!mm) { > - kfree(umem); > - return; > - } > + task = get_pid_task(umem->pid, PIDTYPE_PID); > + put_pid(umem->pid); > + if (!task) > + goto out; > + mm = get_task_mm(task); > + put_task_struct(task); > + if (!mm) > + goto out; > > diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; > > @@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem) > } else > down_write(>mmap_sem); > > - current->mm->pinned_vm -= diff; > + mm->pinned_vm -= diff; > up_write(>mmap_sem); > mmput(mm); > +out: > kfree(umem); > } > EXPORT_SYMBOL(ib_umem_release); > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h > index 1ea0b65..a2bf41e 100644 > --- a/include/rdma/ib_umem.h > +++ b/include/rdma/ib_umem.h > @@ -47,6 +47,7 @@ struct ib_umem { > int writable; > int hugetlb; > struct work_struct work; > + struct pid *pid; > struct mm_struct *mm; > unsigned long diff; > struct sg_table sg_head; > -- > 1.7.7.6 Hi Roland, I haven't seen any additional review feedback, and it doesn't appear that this patch has made its way into any of your infiniband trees yet. Is there anything holding this up? We've been running this patch on top of 3.10 since I originally sent this and have not encountered any issues so far. -- Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
On Wed, Sep 03, 2014 at 12:13:57PM -0500, Shawn Bohrer wrote: From: Shawn Bohrer sboh...@rgmadvisors.com In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm-pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm-pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [814d64b1] dump_stack+0x19/0x1b [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core] [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core] [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [81141cba] __fput+0xba/0x240 [81141e4e] fput+0xe/0x10 [81060894] task_work_run+0xc4/0xe0 [810029e5] do_notify_resume+0x95/0xa0 [814e3dd0] int_signal+0x12/0x17 The following patch fixes the issue by storing the pid struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- v3 changes: * Fix resource leak with put_task_struct() v2 changes: * Updated to use get_task_pid to avoid keeping a reference to the mm drivers/infiniband/core/umem.c | 19 +-- include/rdma/ib_umem.h |1 + 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..df0c4f6 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; + umem-pid = get_task_pid(current, PIDTYPE_PID); /* * We ask for writable memory if any access flags other than * remote read are set. Local write and remote write @@ -198,6 +199,7 @@ out: if (ret 0) { if (need_release) __ib_umem_release(context-device, umem, 0); + put_pid(umem-pid); kfree(umem); } else current-mm-pinned_vm = locked; @@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem-context; struct mm_struct *mm; + struct task_struct *task; unsigned long diff; __ib_umem_release(umem-context-device, umem, 1); - mm = get_task_mm(current); - if (!mm) { - kfree(umem); - return; - } + task = get_pid_task(umem-pid, PIDTYPE_PID); + put_pid(umem-pid); + if (!task) + goto out; + mm = get_task_mm(task); + put_task_struct(task); + if (!mm) + goto out; diff = PAGE_ALIGN(umem-length + umem-offset) PAGE_SHIFT; @@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem) } else down_write(mm-mmap_sem); - current-mm-pinned_vm -= diff; + mm-pinned_vm -= diff; up_write(mm-mmap_sem); mmput(mm); +out: kfree(umem); } EXPORT_SYMBOL(ib_umem_release); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 1ea0b65..a2bf41e 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -47,6 +47,7 @@ struct ib_umem { int writable; int hugetlb; struct work_struct work; + struct pid *pid; struct mm_struct *mm; unsigned long diff; struct sg_table sg_head; -- 1.7.7.6 Hi Roland, I haven't seen any additional review feedback, and it doesn't appear that this patch has made its way into any of your infiniband trees yet. Is there anything holding this up? We've been running this patch on top of 3.10 since I originally sent this and have not encountered any issues so far. -- Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3] ib_umem_release should decrement mm->pinned_vm from ib_umem_get
From: Shawn Bohrer In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm->pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm->pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [] dump_stack+0x19/0x1b [] ib_umem_release+0x1f5/0x200 [ib_core] [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [] ib_destroy_qp+0x12c/0x170 [ib_core] [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [] __fput+0xba/0x240 [] fput+0xe/0x10 [] task_work_run+0xc4/0xe0 [] do_notify_resume+0x95/0xa0 [] int_signal+0x12/0x17 The following patch fixes the issue by storing the pid struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer --- v3 changes: * Fix resource leak with put_task_struct() v2 changes: * Updated to use get_task_pid to avoid keeping a reference to the mm drivers/infiniband/core/umem.c | 19 +-- include/rdma/ib_umem.h |1 + 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..df0c4f6 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->length= size; umem->offset= addr & ~PAGE_MASK; umem->page_size = PAGE_SIZE; + umem->pid = get_task_pid(current, PIDTYPE_PID); /* * We ask for writable memory if any access flags other than * "remote read" are set. "Local write" and "remote write" @@ -198,6 +199,7 @@ out: if (ret < 0) { if (need_release) __ib_umem_release(context->device, umem, 0); + put_pid(umem->pid); kfree(umem); } else current->mm->pinned_vm = locked; @@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem->context; struct mm_struct *mm; + struct task_struct *task; unsigned long diff; __ib_umem_release(umem->context->device, umem, 1); - mm = get_task_mm(current); - if (!mm) { - kfree(umem); - return; - } + task = get_pid_task(umem->pid, PIDTYPE_PID); + put_pid(umem->pid); + if (!task) + goto out; + mm = get_task_mm(task); + put_task_struct(task); + if (!mm) + goto out; diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; @@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem) } else down_write(>mmap_sem); - current->mm->pinned_vm -= diff; + mm->pinned_vm -= diff; up_write(>mmap_sem); mmput(mm); +out: kfree(umem); } EXPORT_SYMBOL(ib_umem_release); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 1ea0b65..a2bf41e 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -47,6 +47,7 @@ struct ib_umem { int writable; int hugetlb; struct work_struct work; + struct pid *pid; struct mm_struct *mm; unsigned long diff; struct sg_table sg_head; -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
From: Shawn Bohrer sboh...@rgmadvisors.com In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm-pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm-pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [814d64b1] dump_stack+0x19/0x1b [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core] [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core] [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [81141cba] __fput+0xba/0x240 [81141e4e] fput+0xe/0x10 [81060894] task_work_run+0xc4/0xe0 [810029e5] do_notify_resume+0x95/0xa0 [814e3dd0] int_signal+0x12/0x17 The following patch fixes the issue by storing the pid struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- v3 changes: * Fix resource leak with put_task_struct() v2 changes: * Updated to use get_task_pid to avoid keeping a reference to the mm drivers/infiniband/core/umem.c | 19 +-- include/rdma/ib_umem.h |1 + 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..df0c4f6 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; + umem-pid = get_task_pid(current, PIDTYPE_PID); /* * We ask for writable memory if any access flags other than * remote read are set. Local write and remote write @@ -198,6 +199,7 @@ out: if (ret 0) { if (need_release) __ib_umem_release(context-device, umem, 0); + put_pid(umem-pid); kfree(umem); } else current-mm-pinned_vm = locked; @@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem-context; struct mm_struct *mm; + struct task_struct *task; unsigned long diff; __ib_umem_release(umem-context-device, umem, 1); - mm = get_task_mm(current); - if (!mm) { - kfree(umem); - return; - } + task = get_pid_task(umem-pid, PIDTYPE_PID); + put_pid(umem-pid); + if (!task) + goto out; + mm = get_task_mm(task); + put_task_struct(task); + if (!mm) + goto out; diff = PAGE_ALIGN(umem-length + umem-offset) PAGE_SHIFT; @@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem) } else down_write(mm-mmap_sem); - current-mm-pinned_vm -= diff; + mm-pinned_vm -= diff; up_write(mm-mmap_sem); mmput(mm); +out: kfree(umem); } EXPORT_SYMBOL(ib_umem_release); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 1ea0b65..a2bf41e 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -47,6 +47,7 @@ struct ib_umem { int writable; int hugetlb; struct work_struct work; + struct pid *pid; struct mm_struct *mm; unsigned long diff; struct sg_table sg_head; -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get
On Thu, Aug 28, 2014 at 02:48:19PM +0300, Haggai Eran wrote: > On 26/08/2014 00:07, Shawn Bohrer wrote: > >>>> The following patch fixes the issue by storing the mm_struct of the > >> > > >> > You are doing more than just storing the mm_struct - you are taking > >> > a reference to the process' mm. This can lead to a massive resource > >> > leakage. The reason is bit complex: The destruction flow for IB > >> > uverbs is based upon releasing the file handle for it. Once the file > >> > handle is released, all MRs, QPs, CQs, PDs, etc. that the process > >> > allocated are released. For the kernel to release the file handle, > >> > the kernel reference count to it needs to reach zero. Most IB > >> > implementations expose some hardware registers to the application by > >> > allowing it to mmap the uverbs device file. This mmap takes a > >> > reference to uverbs device file handle that the application opened. > >> > This reference is dropped when the process mm is released during the > >> > process destruction. Your code takes a reference to the mm that > >> > will only be released when the parent MR/QP is released. > >> > > >> > Now, we have a deadlock - the mm is waiting for the MR to be > >> > destroyed, the MR is waiting for the file handle to be destroyed, > >> > and the file handle is waiting for the mm to be destroyed. > >> > > >> > The proper solution is to keep a reference to the task_pid (using > >> > get_task_pid), and use this pid to get the task_struct and from it > >> > the mm_struct during the destruction flow. > > > > I'll put together a patch using get_task_pid() and see if I can > > test/reproduce the issue. This may take a couple of days since we > > have to test this in production at the moment. > > > > Hi, > > I just wanted to point out that while working on the on demand paging patches > we also needed to keep a reference to the task pid (to make sure we always > handle page faults on behalf of the correct mm struct). You can find the > relevant code in the patch titled "IB/core: Add support for on demand paging > regions" [1]. > > Haggai > > [1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20552.html Haggai, I looked over the on demand paging patch and I'm not sure if you are suggesting that it already fixes my issue, or that I should use it as a reference for my code. In any case I just sent a v2 of a patch that appears to fix my issue. -- Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] ib_umem_release should decrement mm->pinned_vm from ib_umem_get
From: Shawn Bohrer In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm->pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm->pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [] dump_stack+0x19/0x1b [] ib_umem_release+0x1f5/0x200 [ib_core] [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [] ib_destroy_qp+0x12c/0x170 [ib_core] [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [] __fput+0xba/0x240 [] fput+0xe/0x10 [] task_work_run+0xc4/0xe0 [] do_notify_resume+0x95/0xa0 [] int_signal+0x12/0x17 The following patch fixes the issue by storing the pid struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer --- v2 changes: * Updated to use get_task_pid to avoid keeping a reference to the mm I've run this patch on our test pool for general testing for a few days and today verified that it solves the reported issue above on our production machines. drivers/infiniband/core/umem.c | 18 -- include/rdma/ib_umem.h |1 + 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..01750d6 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->length= size; umem->offset= addr & ~PAGE_MASK; umem->page_size = PAGE_SIZE; + umem->pid = get_task_pid(current, PIDTYPE_PID); /* * We ask for writable memory if any access flags other than * "remote read" are set. "Local write" and "remote write" @@ -198,6 +199,7 @@ out: if (ret < 0) { if (need_release) __ib_umem_release(context->device, umem, 0); + put_pid(umem->pid); kfree(umem); } else current->mm->pinned_vm = locked; @@ -230,15 +232,18 @@ void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem->context; struct mm_struct *mm; + struct task_struct *task; unsigned long diff; __ib_umem_release(umem->context->device, umem, 1); - mm = get_task_mm(current); - if (!mm) { - kfree(umem); - return; - } + task = get_pid_task(umem->pid, PIDTYPE_PID); + put_pid(umem->pid); + if (!task) + goto out; + mm = get_task_mm(task); + if (!mm) + goto out; diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; @@ -262,9 +267,10 @@ void ib_umem_release(struct ib_umem *umem) } else down_write(>mmap_sem); - current->mm->pinned_vm -= diff; + mm->pinned_vm -= diff; up_write(>mmap_sem); mmput(mm); +out: kfree(umem); } EXPORT_SYMBOL(ib_umem_release); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 1ea0b65..a2bf41e 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -47,6 +47,7 @@ struct ib_umem { int writable; int hugetlb; struct work_struct work; + struct pid *pid; struct mm_struct *mm; unsigned long diff; struct sg_table sg_head; -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
From: Shawn Bohrer sboh...@rgmadvisors.com In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm-pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm-pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [814d64b1] dump_stack+0x19/0x1b [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core] [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core] [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [81141cba] __fput+0xba/0x240 [81141e4e] fput+0xe/0x10 [81060894] task_work_run+0xc4/0xe0 [810029e5] do_notify_resume+0x95/0xa0 [814e3dd0] int_signal+0x12/0x17 The following patch fixes the issue by storing the pid struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- v2 changes: * Updated to use get_task_pid to avoid keeping a reference to the mm I've run this patch on our test pool for general testing for a few days and today verified that it solves the reported issue above on our production machines. drivers/infiniband/core/umem.c | 18 -- include/rdma/ib_umem.h |1 + 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..01750d6 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; + umem-pid = get_task_pid(current, PIDTYPE_PID); /* * We ask for writable memory if any access flags other than * remote read are set. Local write and remote write @@ -198,6 +199,7 @@ out: if (ret 0) { if (need_release) __ib_umem_release(context-device, umem, 0); + put_pid(umem-pid); kfree(umem); } else current-mm-pinned_vm = locked; @@ -230,15 +232,18 @@ void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem-context; struct mm_struct *mm; + struct task_struct *task; unsigned long diff; __ib_umem_release(umem-context-device, umem, 1); - mm = get_task_mm(current); - if (!mm) { - kfree(umem); - return; - } + task = get_pid_task(umem-pid, PIDTYPE_PID); + put_pid(umem-pid); + if (!task) + goto out; + mm = get_task_mm(task); + if (!mm) + goto out; diff = PAGE_ALIGN(umem-length + umem-offset) PAGE_SHIFT; @@ -262,9 +267,10 @@ void ib_umem_release(struct ib_umem *umem) } else down_write(mm-mmap_sem); - current-mm-pinned_vm -= diff; + mm-pinned_vm -= diff; up_write(mm-mmap_sem); mmput(mm); +out: kfree(umem); } EXPORT_SYMBOL(ib_umem_release); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 1ea0b65..a2bf41e 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -47,6 +47,7 @@ struct ib_umem { int writable; int hugetlb; struct work_struct work; + struct pid *pid; struct mm_struct *mm; unsigned long diff; struct sg_table sg_head; -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
On Thu, Aug 28, 2014 at 02:48:19PM +0300, Haggai Eran wrote: On 26/08/2014 00:07, Shawn Bohrer wrote: The following patch fixes the issue by storing the mm_struct of the You are doing more than just storing the mm_struct - you are taking a reference to the process' mm. This can lead to a massive resource leakage. The reason is bit complex: The destruction flow for IB uverbs is based upon releasing the file handle for it. Once the file handle is released, all MRs, QPs, CQs, PDs, etc. that the process allocated are released. For the kernel to release the file handle, the kernel reference count to it needs to reach zero. Most IB implementations expose some hardware registers to the application by allowing it to mmap the uverbs device file. This mmap takes a reference to uverbs device file handle that the application opened. This reference is dropped when the process mm is released during the process destruction. Your code takes a reference to the mm that will only be released when the parent MR/QP is released. Now, we have a deadlock - the mm is waiting for the MR to be destroyed, the MR is waiting for the file handle to be destroyed, and the file handle is waiting for the mm to be destroyed. The proper solution is to keep a reference to the task_pid (using get_task_pid), and use this pid to get the task_struct and from it the mm_struct during the destruction flow. I'll put together a patch using get_task_pid() and see if I can test/reproduce the issue. This may take a couple of days since we have to test this in production at the moment. Hi, I just wanted to point out that while working on the on demand paging patches we also needed to keep a reference to the task pid (to make sure we always handle page faults on behalf of the correct mm struct). You can find the relevant code in the patch titled IB/core: Add support for on demand paging regions [1]. Haggai [1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20552.html Haggai, I looked over the on demand paging patch and I'm not sure if you are suggesting that it already fixes my issue, or that I should use it as a reference for my code. In any case I just sent a v2 of a patch that appears to fix my issue. -- Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get
On Thu, Aug 21, 2014 at 11:20:34AM +, Shachar Raindel wrote: > Hi, > > I'm afraid this patch, in its current form, will not work. > See below for additional comments. Thanks for the input Shachar. I've tried to answer your questions below. > > > In debugging an application that receives -ENOMEM from ib_reg_mr() I > > > found that ib_umem_get() can fail because the pinned_vm count has > > > wrapped causing it to always be larger than the lock limit even with > > > RLIMIT_MEMLOCK set to RLIM_INFINITY. > > > > > > The wrapping of pinned_vm occurs because the process that calls > > > ib_reg_mr() will have its mm->pinned_vm count incremented. Later a > > > different process with a different mm_struct than the one that allocated > > > the ib_umem struct ends up releasing it which results in decrementing > > > the new processes mm->pinned_vm count past zero and wrapping. > > > > > > I'm not entirely sure what circumstances cause a different process to > > > release the ib_umem than the one that allocated it but the kernel stack > > > trace of the freeing process from my situation looks like the following: > > > > > > Call Trace: > > > [] dump_stack+0x19/0x1b > > > [] ib_umem_release+0x1f5/0x200 [ib_core] > > > [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] > > > [] ib_destroy_qp+0x12c/0x170 [ib_core] > > > [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] > > > [] __fput+0xba/0x240 > > > [] fput+0xe/0x10 > > > [] task_work_run+0xc4/0xe0 > > > [] do_notify_resume+0x95/0xa0 > > > [] int_signal+0x12/0x17 > > > > > Can you provide the details of this issue - kernel version, > reproduction steps, etc.? It seems like the kernel code flow which > triggers this is delaying the FD release done at > http://lxr.free-electrons.com/source/fs/file_table.c#L279 . The > code there seems to have changed (starting at kernel 3.6) to avoid > releasing a file in interrupt context or from a kernel thread. How > are we ending up with releasing the uverbs device file from an > interrupt context or a kernel thread? We are seeing this on 3.10.* kernels. Unfortunately I'm not quite sure what the reproducing steps are, because we can't reliably reproduce it. Or rather we have been able to reliably reproduce the issue in certain production situations, but can't seem to reproduce it outside of production so it seems we are missing something. What I do know is that the issue often occurs when we try to replace a set of processes with a new set of processes. Both process sets will be using RC and UD QPs. When I finally discovered what the issue was, I clearly saw an ib_umem struct allocated in one of the processes that was going away get released in the context of one of the newly started processes. > > > The following patch fixes the issue by storing the mm_struct of the > > You are doing more than just storing the mm_struct - you are taking > a reference to the process' mm. This can lead to a massive resource > leakage. The reason is bit complex: The destruction flow for IB > uverbs is based upon releasing the file handle for it. Once the file > handle is released, all MRs, QPs, CQs, PDs, etc. that the process > allocated are released. For the kernel to release the file handle, > the kernel reference count to it needs to reach zero. Most IB > implementations expose some hardware registers to the application by > allowing it to mmap the uverbs device file. This mmap takes a > reference to uverbs device file handle that the application opened. > This reference is dropped when the process mm is released during the > process destruction. Your code takes a reference to the mm that > will only be released when the parent MR/QP is released. > > Now, we have a deadlock - the mm is waiting for the MR to be > destroyed, the MR is waiting for the file handle to be destroyed, > and the file handle is waiting for the mm to be destroyed. > > The proper solution is to keep a reference to the task_pid (using > get_task_pid), and use this pid to get the task_struct and from it > the mm_struct during the destruction flow. I'll put together a patch using get_task_pid() and see if I can test/reproduce the issue. This may take a couple of days since we have to test this in production at the moment. > > > process that calls ib_umem_get() so that ib_umem_release and/or > > > ib_umem_account() can properly decrement the pinned_vm count of the > > > correct mm_struct. > > > > > > Signed-off-by: Shawn Bohrer > > > --- > > > drivers/infiniband/core/umem.c | 17 - > > > 1 files chan
Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
On Thu, Aug 21, 2014 at 11:20:34AM +, Shachar Raindel wrote: Hi, I'm afraid this patch, in its current form, will not work. See below for additional comments. Thanks for the input Shachar. I've tried to answer your questions below. In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm-pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm-pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [814d64b1] dump_stack+0x19/0x1b [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core] [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core] [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [81141cba] __fput+0xba/0x240 [81141e4e] fput+0xe/0x10 [81060894] task_work_run+0xc4/0xe0 [810029e5] do_notify_resume+0x95/0xa0 [814e3dd0] int_signal+0x12/0x17 Can you provide the details of this issue - kernel version, reproduction steps, etc.? It seems like the kernel code flow which triggers this is delaying the FD release done at http://lxr.free-electrons.com/source/fs/file_table.c#L279 . The code there seems to have changed (starting at kernel 3.6) to avoid releasing a file in interrupt context or from a kernel thread. How are we ending up with releasing the uverbs device file from an interrupt context or a kernel thread? We are seeing this on 3.10.* kernels. Unfortunately I'm not quite sure what the reproducing steps are, because we can't reliably reproduce it. Or rather we have been able to reliably reproduce the issue in certain production situations, but can't seem to reproduce it outside of production so it seems we are missing something. What I do know is that the issue often occurs when we try to replace a set of processes with a new set of processes. Both process sets will be using RC and UD QPs. When I finally discovered what the issue was, I clearly saw an ib_umem struct allocated in one of the processes that was going away get released in the context of one of the newly started processes. The following patch fixes the issue by storing the mm_struct of the You are doing more than just storing the mm_struct - you are taking a reference to the process' mm. This can lead to a massive resource leakage. The reason is bit complex: The destruction flow for IB uverbs is based upon releasing the file handle for it. Once the file handle is released, all MRs, QPs, CQs, PDs, etc. that the process allocated are released. For the kernel to release the file handle, the kernel reference count to it needs to reach zero. Most IB implementations expose some hardware registers to the application by allowing it to mmap the uverbs device file. This mmap takes a reference to uverbs device file handle that the application opened. This reference is dropped when the process mm is released during the process destruction. Your code takes a reference to the mm that will only be released when the parent MR/QP is released. Now, we have a deadlock - the mm is waiting for the MR to be destroyed, the MR is waiting for the file handle to be destroyed, and the file handle is waiting for the mm to be destroyed. The proper solution is to keep a reference to the task_pid (using get_task_pid), and use this pid to get the task_struct and from it the mm_struct during the destruction flow. I'll put together a patch using get_task_pid() and see if I can test/reproduce the issue. This may take a couple of days since we have to test this in production at the moment. process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- drivers/infiniband/core/umem.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..32699024 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr
Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get
On Tue, Aug 12, 2014 at 11:27:35AM -0500, Shawn Bohrer wrote: > From: Shawn Bohrer > > In debugging an application that receives -ENOMEM from ib_reg_mr() I > found that ib_umem_get() can fail because the pinned_vm count has > wrapped causing it to always be larger than the lock limit even with > RLIMIT_MEMLOCK set to RLIM_INFINITY. > > The wrapping of pinned_vm occurs because the process that calls > ib_reg_mr() will have its mm->pinned_vm count incremented. Later a > different process with a different mm_struct than the one that allocated > the ib_umem struct ends up releasing it which results in decrementing > the new processes mm->pinned_vm count past zero and wrapping. > > I'm not entirely sure what circumstances cause a different process to > release the ib_umem than the one that allocated it but the kernel stack > trace of the freeing process from my situation looks like the following: > > Call Trace: > [] dump_stack+0x19/0x1b > [] ib_umem_release+0x1f5/0x200 [ib_core] > [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] > [] ib_destroy_qp+0x12c/0x170 [ib_core] > [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] > [] __fput+0xba/0x240 > [] fput+0xe/0x10 > [] task_work_run+0xc4/0xe0 > [] do_notify_resume+0x95/0xa0 > [] int_signal+0x12/0x17 > > The following patch fixes the issue by storing the mm_struct of the > process that calls ib_umem_get() so that ib_umem_release and/or > ib_umem_account() can properly decrement the pinned_vm count of the > correct mm_struct. > > Signed-off-by: Shawn Bohrer > --- > drivers/infiniband/core/umem.c | 17 - > 1 files changed, 8 insertions(+), 9 deletions(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index a3a2e9c..32699024 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, > unsigned long addr, > umem->length= size; > umem->offset= addr & ~PAGE_MASK; > umem->page_size = PAGE_SIZE; > + umem->mm= get_task_mm(current); > /* >* We ask for writable memory if any access flags other than >* "remote read" are set. "Local write" and "remote write" > @@ -198,6 +199,7 @@ out: > if (ret < 0) { > if (need_release) > __ib_umem_release(context->device, umem, 0); > + mmput(umem->mm); > kfree(umem); > } else > current->mm->pinned_vm = locked; > @@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work) > void ib_umem_release(struct ib_umem *umem) > { > struct ib_ucontext *context = umem->context; > - struct mm_struct *mm; > unsigned long diff; > > __ib_umem_release(umem->context->device, umem, 1); > > - mm = get_task_mm(current); > - if (!mm) { > + if (!umem->mm) { > kfree(umem); > return; > } > @@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem) >* we defer the vm_locked accounting to the system workqueue. >*/ > if (context->closing) { > - if (!down_write_trylock(>mmap_sem)) { > + if (!down_write_trylock(>mm->mmap_sem)) { > INIT_WORK(>work, ib_umem_account); > - umem->mm = mm; > umem->diff = diff; > > queue_work(ib_wq, >work); > return; > } > } else > - down_write(>mmap_sem); > + down_write(>mm->mmap_sem); > > - current->mm->pinned_vm -= diff; > - up_write(>mmap_sem); > - mmput(mm); > + umem->mm->pinned_vm -= diff; > + up_write(>mm->mmap_sem); > + mmput(umem->mm); > kfree(umem); > } > EXPORT_SYMBOL(ib_umem_release); It doesn't look like this has been applied yet. Does anyone have any feedback? Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
On Tue, Aug 12, 2014 at 11:27:35AM -0500, Shawn Bohrer wrote: From: Shawn Bohrer sboh...@rgmadvisors.com In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm-pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm-pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [814d64b1] dump_stack+0x19/0x1b [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core] [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core] [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [81141cba] __fput+0xba/0x240 [81141e4e] fput+0xe/0x10 [81060894] task_work_run+0xc4/0xe0 [810029e5] do_notify_resume+0x95/0xa0 [814e3dd0] int_signal+0x12/0x17 The following patch fixes the issue by storing the mm_struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- drivers/infiniband/core/umem.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..32699024 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; + umem-mm= get_task_mm(current); /* * We ask for writable memory if any access flags other than * remote read are set. Local write and remote write @@ -198,6 +199,7 @@ out: if (ret 0) { if (need_release) __ib_umem_release(context-device, umem, 0); + mmput(umem-mm); kfree(umem); } else current-mm-pinned_vm = locked; @@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work) void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem-context; - struct mm_struct *mm; unsigned long diff; __ib_umem_release(umem-context-device, umem, 1); - mm = get_task_mm(current); - if (!mm) { + if (!umem-mm) { kfree(umem); return; } @@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem) * we defer the vm_locked accounting to the system workqueue. */ if (context-closing) { - if (!down_write_trylock(mm-mmap_sem)) { + if (!down_write_trylock(umem-mm-mmap_sem)) { INIT_WORK(umem-work, ib_umem_account); - umem-mm = mm; umem-diff = diff; queue_work(ib_wq, umem-work); return; } } else - down_write(mm-mmap_sem); + down_write(umem-mm-mmap_sem); - current-mm-pinned_vm -= diff; - up_write(mm-mmap_sem); - mmput(mm); + umem-mm-pinned_vm -= diff; + up_write(umem-mm-mmap_sem); + mmput(umem-mm); kfree(umem); } EXPORT_SYMBOL(ib_umem_release); It doesn't look like this has been applied yet. Does anyone have any feedback? Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get
From: Shawn Bohrer In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm->pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm->pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [] dump_stack+0x19/0x1b [] ib_umem_release+0x1f5/0x200 [ib_core] [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [] ib_destroy_qp+0x12c/0x170 [ib_core] [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [] __fput+0xba/0x240 [] fput+0xe/0x10 [] task_work_run+0xc4/0xe0 [] do_notify_resume+0x95/0xa0 [] int_signal+0x12/0x17 The following patch fixes the issue by storing the mm_struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer --- drivers/infiniband/core/umem.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..32699024 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->length= size; umem->offset= addr & ~PAGE_MASK; umem->page_size = PAGE_SIZE; + umem->mm= get_task_mm(current); /* * We ask for writable memory if any access flags other than * "remote read" are set. "Local write" and "remote write" @@ -198,6 +199,7 @@ out: if (ret < 0) { if (need_release) __ib_umem_release(context->device, umem, 0); + mmput(umem->mm); kfree(umem); } else current->mm->pinned_vm = locked; @@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work) void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem->context; - struct mm_struct *mm; unsigned long diff; __ib_umem_release(umem->context->device, umem, 1); - mm = get_task_mm(current); - if (!mm) { + if (!umem->mm) { kfree(umem); return; } @@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem) * we defer the vm_locked accounting to the system workqueue. */ if (context->closing) { - if (!down_write_trylock(>mmap_sem)) { + if (!down_write_trylock(>mm->mmap_sem)) { INIT_WORK(>work, ib_umem_account); - umem->mm = mm; umem->diff = diff; queue_work(ib_wq, >work); return; } } else - down_write(>mmap_sem); + down_write(>mm->mmap_sem); - current->mm->pinned_vm -= diff; - up_write(>mmap_sem); - mmput(mm); + umem->mm->pinned_vm -= diff; + up_write(>mm->mmap_sem); + mmput(umem->mm); kfree(umem); } EXPORT_SYMBOL(ib_umem_release); -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get
From: Shawn Bohrer sboh...@rgmadvisors.com In debugging an application that receives -ENOMEM from ib_reg_mr() I found that ib_umem_get() can fail because the pinned_vm count has wrapped causing it to always be larger than the lock limit even with RLIMIT_MEMLOCK set to RLIM_INFINITY. The wrapping of pinned_vm occurs because the process that calls ib_reg_mr() will have its mm-pinned_vm count incremented. Later a different process with a different mm_struct than the one that allocated the ib_umem struct ends up releasing it which results in decrementing the new processes mm-pinned_vm count past zero and wrapping. I'm not entirely sure what circumstances cause a different process to release the ib_umem than the one that allocated it but the kernel stack trace of the freeing process from my situation looks like the following: Call Trace: [814d64b1] dump_stack+0x19/0x1b [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core] [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib] [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core] [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs] [81141cba] __fput+0xba/0x240 [81141e4e] fput+0xe/0x10 [81060894] task_work_run+0xc4/0xe0 [810029e5] do_notify_resume+0x95/0xa0 [814e3dd0] int_signal+0x12/0x17 The following patch fixes the issue by storing the mm_struct of the process that calls ib_umem_get() so that ib_umem_release and/or ib_umem_account() can properly decrement the pinned_vm count of the correct mm_struct. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- drivers/infiniband/core/umem.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a3a2e9c..32699024 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem-length= size; umem-offset= addr ~PAGE_MASK; umem-page_size = PAGE_SIZE; + umem-mm= get_task_mm(current); /* * We ask for writable memory if any access flags other than * remote read are set. Local write and remote write @@ -198,6 +199,7 @@ out: if (ret 0) { if (need_release) __ib_umem_release(context-device, umem, 0); + mmput(umem-mm); kfree(umem); } else current-mm-pinned_vm = locked; @@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work) void ib_umem_release(struct ib_umem *umem) { struct ib_ucontext *context = umem-context; - struct mm_struct *mm; unsigned long diff; __ib_umem_release(umem-context-device, umem, 1); - mm = get_task_mm(current); - if (!mm) { + if (!umem-mm) { kfree(umem); return; } @@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem) * we defer the vm_locked accounting to the system workqueue. */ if (context-closing) { - if (!down_write_trylock(mm-mmap_sem)) { + if (!down_write_trylock(umem-mm-mmap_sem)) { INIT_WORK(umem-work, ib_umem_account); - umem-mm = mm; umem-diff = diff; queue_work(ib_wq, umem-work); return; } } else - down_write(mm-mmap_sem); + down_write(umem-mm-mmap_sem); - current-mm-pinned_vm -= diff; - up_write(mm-mmap_sem); - mmput(mm); + umem-mm-pinned_vm -= diff; + up_write(umem-mm-mmap_sem); + mmput(umem-mm); kfree(umem); } EXPORT_SYMBOL(ib_umem_release); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote: > > Thanks Tejun and Hugh. Sorry for my late entry in getting around to > > testing this fix. On the surface it sounds correct however I'd like to > > test this on top of 3.10.* since that is what we'll likely be running. > > I've tried to apply Hugh's patch above on top of 3.10.19 but it > > appears there are a number of conflicts. Looking over the changes and > > my understanding of the problem I believe on 3.10 only the > > cgroup_free_fn needs to be run in a separate workqueue. Below is the > > patch I've applied on top of 3.10.19, which I'm about to start > > testing. If it looks like I botched the backport in any way please > > let me know so I can test a propper fix on top of 3.10.19. > > > > You didn't move css free_work to the dedicate wq as Tejun's patch does. > css free_work won't acquire cgroup_mutex, but when destroying a lot of > cgroups, we can have a lot of css free_work in the workqueue, so I'd > suggest you also use cgroup_destroy_wq for it. Well, I didn't move the css free_work, but I did test the patch I posted on top of 3.10.19 and I am unable to reproduce the lockup so it appears my patch was sufficient for 3.10.*. Hopefully we can get this fix applied and backported into stable. Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote: Thanks Tejun and Hugh. Sorry for my late entry in getting around to testing this fix. On the surface it sounds correct however I'd like to test this on top of 3.10.* since that is what we'll likely be running. I've tried to apply Hugh's patch above on top of 3.10.19 but it appears there are a number of conflicts. Looking over the changes and my understanding of the problem I believe on 3.10 only the cgroup_free_fn needs to be run in a separate workqueue. Below is the patch I've applied on top of 3.10.19, which I'm about to start testing. If it looks like I botched the backport in any way please let me know so I can test a propper fix on top of 3.10.19. You didn't move css free_work to the dedicate wq as Tejun's patch does. css free_work won't acquire cgroup_mutex, but when destroying a lot of cgroups, we can have a lot of css free_work in the workqueue, so I'd suggest you also use cgroup_destroy_wq for it. Well, I didn't move the css free_work, but I did test the patch I posted on top of 3.10.19 and I am unable to reproduce the lockup so it appears my patch was sufficient for 3.10.*. Hopefully we can get this fix applied and backported into stable. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup_mutex deadlock
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote: > Sorry for the delay: I was on the point of reporting success last > night, when I tried a debug kernel: and that didn't work so well > (got spinlock bad magic report in pwd_adjust_max_active(), and > tests wouldn't run at all). > > Even the non-early cgroup_init() is called well before the > early_initcall init_workqueues(): though only the debug (lockdep > and spinlock debug) kernel appeared to have a problem with that. > > Here's the patch I ended up with successfully on a 3.11.7-based > kernel (though below I've rediffed it against 3.11.8): the > schedule_work->queue_work hunks are slightly different on 3.11 > than in your patch against current, and I did alloc_workqueue() > from a separate core_initcall. > > The interval between cgroup_init and that is a bit of a worry; > but we don't seem to have suffered from the interval between > cgroup_init and init_workqueues before (when system_wq is NULL) > - though you may have more courage than I to reorder them! > > Initially I backed out my system_highpri_wq workaround, and > verified that it was still easy to reproduce the problem with > one of our cgroup stresstests. Yes it was, then your modified > patch below convincingly fixed it. > > I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: > as I'd expected, that didn't solve this issue (though it's worth > our keeping it in to rule out another source of problems). And I > checked back on dumps of failures: they indeed show the tell-tale > 256 kworkers doing cgroup_offline_fn, just as you predicted. > > Thanks! > Hugh > > --- > kernel/cgroup.c | 30 +++--- > 1 file changed, 27 insertions(+), 3 deletions(-) > > --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800 > +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800 > @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex); > static DEFINE_MUTEX(cgroup_root_mutex); > > /* > + * cgroup destruction makes heavy use of work items and there can be a lot > + * of concurrent destructions. Use a separate workqueue so that cgroup > + * destruction work items don't end up filling up max_active of system_wq > + * which may lead to deadlock. > + */ > +static struct workqueue_struct *cgroup_destroy_wq; > + > +/* > * Generate an array of cgroup subsystem pointers. At boot time, this is > * populated with the built in subsystems, and modular subsystems are > * registered after that. The mutable section of this array is protected by > @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h > struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); > > INIT_WORK(>destroy_work, cgroup_free_fn); > - schedule_work(>destroy_work); > + queue_work(cgroup_destroy_wq, >destroy_work); > } > > static void cgroup_diput(struct dentry *dentry, struct inode *inode) > @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re > struct cgroup_subsys_state *css = > container_of(ref, struct cgroup_subsys_state, refcnt); > > - schedule_work(>dput_work); > + queue_work(cgroup_destroy_wq, >dput_work); > } > > static void init_cgroup_css(struct cgroup_subsys_state *css, > @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr > > /* percpu ref's of all css's are killed, kick off the next step */ > INIT_WORK(>destroy_work, cgroup_offline_fn); > - schedule_work(>destroy_work); > + queue_work(cgroup_destroy_wq, >destroy_work); > } > > static void css_ref_killed_fn(struct percpu_ref *ref) > @@ -4967,6 +4975,22 @@ out: > return err; > } > > +static int __init cgroup_destroy_wq_init(void) > +{ > + /* > + * There isn't much point in executing destruction path in > + * parallel. Good chunk is serialized with cgroup_mutex anyway. > + * Use 1 for @max_active. > + * > + * We would prefer to do this in cgroup_init() above, but that > + * is called before init_workqueues(): so leave this until after. > + */ > + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); > + BUG_ON(!cgroup_destroy_wq); > + return 0; > +} > +core_initcall(cgroup_destroy_wq_init); > + > /* > * proc_cgroup_show() > * - Print task's cgroup paths into seq_file, one line for each hierarchy Thanks Tejun and Hugh. Sorry for my late entry in getting around to testing this fix. On the surface it sounds correct however I'd like to test this on top of 3.10.* since that is what we'll likely be running. I've tried to apply Hugh's patch above on top of 3.10.19 but it appears there are a number of conflicts. Looking over the changes and my understanding of the problem I believe on 3.10 only the cgroup_free_fn needs to be run in a separate workqueue. Below is the patch I've applied on top of 3.10.19, which I'm about to start testing. If it looks like I botched the backport in any way please let me know so I
Re: 3.10.16 cgroup_mutex deadlock
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote: Sorry for the delay: I was on the point of reporting success last night, when I tried a debug kernel: and that didn't work so well (got spinlock bad magic report in pwd_adjust_max_active(), and tests wouldn't run at all). Even the non-early cgroup_init() is called well before the early_initcall init_workqueues(): though only the debug (lockdep and spinlock debug) kernel appeared to have a problem with that. Here's the patch I ended up with successfully on a 3.11.7-based kernel (though below I've rediffed it against 3.11.8): the schedule_work-queue_work hunks are slightly different on 3.11 than in your patch against current, and I did alloc_workqueue() from a separate core_initcall. The interval between cgroup_init and that is a bit of a worry; but we don't seem to have suffered from the interval between cgroup_init and init_workqueues before (when system_wq is NULL) - though you may have more courage than I to reorder them! Initially I backed out my system_highpri_wq workaround, and verified that it was still easy to reproduce the problem with one of our cgroup stresstests. Yes it was, then your modified patch below convincingly fixed it. I ran with Johannes's patch adding extra mem_cgroup_reparent_charges: as I'd expected, that didn't solve this issue (though it's worth our keeping it in to rule out another source of problems). And I checked back on dumps of failures: they indeed show the tell-tale 256 kworkers doing cgroup_offline_fn, just as you predicted. Thanks! Hugh --- kernel/cgroup.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800 +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800 @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex); static DEFINE_MUTEX(cgroup_root_mutex); /* + * cgroup destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup + * destruction work items don't end up filling up max_active of system_wq + * which may lead to deadlock. + */ +static struct workqueue_struct *cgroup_destroy_wq; + +/* * Generate an array of cgroup subsystem pointers. At boot time, this is * populated with the built in subsystems, and modular subsystems are * registered after that. The mutable section of this array is protected by @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); INIT_WORK(cgrp-destroy_work, cgroup_free_fn); - schedule_work(cgrp-destroy_work); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - schedule_work(css-dput_work); + queue_work(cgroup_destroy_wq, css-dput_work); } static void init_cgroup_css(struct cgroup_subsys_state *css, @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr /* percpu ref's of all css's are killed, kick off the next step */ INIT_WORK(cgrp-destroy_work, cgroup_offline_fn); - schedule_work(cgrp-destroy_work); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void css_ref_killed_fn(struct percpu_ref *ref) @@ -4967,6 +4975,22 @@ out: return err; } +static int __init cgroup_destroy_wq_init(void) +{ + /* + * There isn't much point in executing destruction path in + * parallel. Good chunk is serialized with cgroup_mutex anyway. + * Use 1 for @max_active. + * + * We would prefer to do this in cgroup_init() above, but that + * is called before init_workqueues(): so leave this until after. + */ + cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1); + BUG_ON(!cgroup_destroy_wq); + return 0; +} +core_initcall(cgroup_destroy_wq_init); + /* * proc_cgroup_show() * - Print task's cgroup paths into seq_file, one line for each hierarchy Thanks Tejun and Hugh. Sorry for my late entry in getting around to testing this fix. On the surface it sounds correct however I'd like to test this on top of 3.10.* since that is what we'll likely be running. I've tried to apply Hugh's patch above on top of 3.10.19 but it appears there are a number of conflicts. Looking over the changes and my understanding of the problem I believe on 3.10 only the cgroup_free_fn needs to be run in a separate workqueue. Below is the patch I've applied on top of 3.10.19, which I'm about to start testing. If it looks like I botched the backport in any way please let me know so I can test a propper fix on top of 3.10.19. --- kernel/cgroup.c | 28
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote: > On Tue 12-11-13 09:55:30, Shawn Bohrer wrote: > > On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote: > > > On Tue 12-11-13 18:17:20, Li Zefan wrote: > > > > Cc more people > > > > > > > > On 2013/11/12 6:06, Shawn Bohrer wrote: > > > > > Hello, > > > > > > > > > > This morning I had a machine running 3.10.16 go unresponsive but > > > > > before we killed it we were able to get the information below. I'm > > > > > not an expert here but it looks like most of the tasks below are > > > > > blocking waiting on the cgroup_mutex. You can see that the > > > > > resource_alloca:16502 task is holding the cgroup_mutex and that task > > > > > appears to be waiting on a lru_add_drain_all() to complete. > > > > > > Do you have sysrq+l output as well by any chance? That would tell > > > us what the current CPUs are doing. Dumping all kworker stacks > > > might be helpful as well. We know that lru_add_drain_all waits for > > > schedule_on_each_cpu to return so it is waiting for workers to finish. > > > I would be really curious why some of lru_add_drain_cpu cannot finish > > > properly. The only reason would be that some work item(s) do not get CPU > > > or somebody is holding lru_lock. > > > > In fact the sys-admin did manage to fire off a sysrq+l, I've put all > > of the info from the syslog below. I've looked it over and I'm not > > sure it reveals anything. First looking at the timestamps it appears > > we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I > > previously sent. > > I would expect sysrq+w would still show those kworkers blocked on the > same cgroup mutex? Yes, I believe so. > > I also have atop logs over that whole time period > > that show hundreds of zombie processes which to me indicates that over > > that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking > > at the backtraces from the sysrq+l it appears most of the CPUs were > > idle > > Right so either we managed to sleep with the lru_lock held which sounds > a bit improbable - but who knows - or there is some other problem. I > would expect the later to be true. > > lru_add_drain executes per-cpu and preemption disabled this means that > its work item cannot be preempted so the only logical explanation seems > to be that the work item has never got scheduled. Meaning you think there would be no kworker thread for the lru_add_drain at this point? If so you might be correct. > OK. In case the issue happens again. It would be very helpful to get the > kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue > debugging tricks. I set up one of my test pools with two scripts trying to reproduce the problem. One essentially puts tasks into several cpuset groups that have cpuset.memory_migrate set, then takes them back out. It also occasionally switches cpuset.mems in those groups to try to keep the memory of those tasks migrating between nodes. The second script is: $ cat /home/hbi/cgroup_mutex_cgroup_maker.sh #!/bin/bash session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o) cd /sys/fs/cgroup/systemd/user/hbi/${session_group} pwd while true; do for x in $(seq 1 1000); do mkdir $x echo $$ > ${x}/tasks echo $$ > tasks rmdir $x done sleep .1 date done After running both concurrently on 40 machines for about 12 hours I've managed to reproduce the issue at least once, possibly more. One machine looked identical to this reported issue. It has a bunch of stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle except for the one triggering the sysrq+l. The sysrq+w unfortunately wrapped dmesg so we didn't get the stacks of all blocked tasks. We did however also cat /proc//stack of all kworker threads on the system. There were 265 kworker threads that all have the following stack: [kworker/2:1] [] cgroup_free_fn+0x2c/0x120 [] process_one_work+0x174/0x490 [] worker_thread+0x11c/0x370 [] kthread+0xc0/0xd0 [] ret_from_fork+0x7c/0xb0 [] 0x And there were another 101 that had stacks like the following: [kworker/0:0] [] worker_thread+0x1bf/0x370 [] kthread+0xc0/0xd0 [] ret_from_fork+0x7c/0xb0 [] 0x That's it. Again I'm not sure if that is helpful at all but it seems to imply that the lru_add_drain_work was not scheduled. I also managed to kill another two machines running my test. One of them we didn't get anything out of, and the other looks like I deadlocked on the css_set_lock lock. I'
Re: 3.10.16 cgroup_mutex deadlock
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote: On Tue 12-11-13 09:55:30, Shawn Bohrer wrote: On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote: On Tue 12-11-13 18:17:20, Li Zefan wrote: Cc more people On 2013/11/12 6:06, Shawn Bohrer wrote: Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Do you have sysrq+l output as well by any chance? That would tell us what the current CPUs are doing. Dumping all kworker stacks might be helpful as well. We know that lru_add_drain_all waits for schedule_on_each_cpu to return so it is waiting for workers to finish. I would be really curious why some of lru_add_drain_cpu cannot finish properly. The only reason would be that some work item(s) do not get CPU or somebody is holding lru_lock. In fact the sys-admin did manage to fire off a sysrq+l, I've put all of the info from the syslog below. I've looked it over and I'm not sure it reveals anything. First looking at the timestamps it appears we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I previously sent. I would expect sysrq+w would still show those kworkers blocked on the same cgroup mutex? Yes, I believe so. I also have atop logs over that whole time period that show hundreds of zombie processes which to me indicates that over that 19.2 hours systemd remained wedged on the cgroup_mutex. Looking at the backtraces from the sysrq+l it appears most of the CPUs were idle Right so either we managed to sleep with the lru_lock held which sounds a bit improbable - but who knows - or there is some other problem. I would expect the later to be true. lru_add_drain executes per-cpu and preemption disabled this means that its work item cannot be preempted so the only logical explanation seems to be that the work item has never got scheduled. Meaning you think there would be no kworker thread for the lru_add_drain at this point? If so you might be correct. OK. In case the issue happens again. It would be very helpful to get the kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue debugging tricks. I set up one of my test pools with two scripts trying to reproduce the problem. One essentially puts tasks into several cpuset groups that have cpuset.memory_migrate set, then takes them back out. It also occasionally switches cpuset.mems in those groups to try to keep the memory of those tasks migrating between nodes. The second script is: $ cat /home/hbi/cgroup_mutex_cgroup_maker.sh #!/bin/bash session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o) cd /sys/fs/cgroup/systemd/user/hbi/${session_group} pwd while true; do for x in $(seq 1 1000); do mkdir $x echo $$ ${x}/tasks echo $$ tasks rmdir $x done sleep .1 date done After running both concurrently on 40 machines for about 12 hours I've managed to reproduce the issue at least once, possibly more. One machine looked identical to this reported issue. It has a bunch of stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle except for the one triggering the sysrq+l. The sysrq+w unfortunately wrapped dmesg so we didn't get the stacks of all blocked tasks. We did however also cat /proc/pid/stack of all kworker threads on the system. There were 265 kworker threads that all have the following stack: [kworker/2:1] [810930ec] cgroup_free_fn+0x2c/0x120 [81057c54] process_one_work+0x174/0x490 [81058d0c] worker_thread+0x11c/0x370 [8105f0b0] kthread+0xc0/0xd0 [814c20dc] ret_from_fork+0x7c/0xb0 [] 0x And there were another 101 that had stacks like the following: [kworker/0:0] [81058daf] worker_thread+0x1bf/0x370 [8105f0b0] kthread+0xc0/0xd0 [814c20dc] ret_from_fork+0x7c/0xb0 [] 0x That's it. Again I'm not sure if that is helpful at all but it seems to imply that the lru_add_drain_work was not scheduled. I also managed to kill another two machines running my test. One of them we didn't get anything out of, and the other looks like I deadlocked on the css_set_lock lock. I'll follow up with the css_set_lock deadlock in another email since it doesn't look related to this one. But it does seem that I can probably reproduce this if anyone has some debugging ideas. -- Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord
3.10.16 cgroup_mutex deadlock
Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Initially I thought the deadlock might simply be that the per cpu workqueue work from lru_add_drain_all() is stuck waiting on the cgroup_free_fn to complete. However I've read Documentation/workqueue.txt and it sounds like the current workqueue has multiple kworker threads per cpu and thus this should not happen. Both the cgroup_free_fn work and lru_add_dran_all() work run on the system_wq which has max_active set to 0 so I believe multiple kworker threads should run. This also appears to be true since all of the cgroup_free_fn are running on kworker/12 thread and there are multiple blocked. Perhaps someone with more experience in the cgroup and workqueue code can look at the stacks below and identify the problem, or explain why the lru_add_drain_all() work has not completed: [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds. [694702.015794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694702.018217] systemd D 81607820 0 1 0 0x [694702.020505] 88041dcc1d78 0086 88041dc7f100 8110ad54 [694702.023006] 0001 88041dc78000 88041dcc1fd8 88041dcc1fd8 [694702.025508] 88041dcc1fd8 88041dc78000 88041a1e8698 81a417c0 [694702.028011] Call Trace: [694702.028788] [] ? vma_merge+0x124/0x330 [694702.030468] [] schedule+0x29/0x70 [694702.032011] [] schedule_preempt_disabled+0xe/0x10 [694702.033982] [] __mutex_lock_slowpath+0x112/0x1b0 [694702.035926] [] ? kmem_cache_alloc_trace+0x12d/0x160 [694702.037948] [] mutex_lock+0x2a/0x50 [694702.039546] [] proc_cgroup_show+0x67/0x1d0 [694702.041330] [] seq_read+0x16b/0x3e0 [694702.042927] [] vfs_read+0xb0/0x180 [694702.044498] [] SyS_read+0x52/0xa0 [694702.046042] [] system_call_fastpath+0x16/0x1b [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds. [694702.050044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694702.052467] kworker/12:1D 0 203 2 0x [694702.054756] Workqueue: events cgroup_free_fn [694702.056139] 88041bc1fcf8 0046 88038e7b46a0 00030001 [694702.058642] 88041bc1fd84 88041da6e9f0 88041bc1ffd8 88041bc1ffd8 [694702.061144] 88041bc1ffd8 88041da6e9f0 0087 81a417c0 [694702.063647] Call Trace: [694702.064423] [] schedule+0x29/0x70 [694702.065966] [] schedule_preempt_disabled+0xe/0x10 [694702.067936] [] __mutex_lock_slowpath+0x112/0x1b0 [694702.069879] [] mutex_lock+0x2a/0x50 [694702.071476] [] cgroup_free_fn+0x2c/0x120 [694702.073209] [] process_one_work+0x174/0x490 [694702.075019] [] worker_thread+0x11c/0x370 [694702.076748] [] ? manage_workers+0x2c0/0x2c0 [694702.078560] [] kthread+0xc0/0xd0 [694702.080078] [] ? flush_kthread_worker+0xb0/0xb0 [694702.081995] [] ret_from_fork+0x7c/0xb0 [694702.083671] [] ? flush_kthread_worker+0xb0/0xb0 [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 seconds. [694702.087801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694702.090225] systemd-logind D 81607820 0 2885 1 0x [694702.092513] 88041ac6fd88 0082 88041dd8aa60 88041d9bc1a8 [694702.095014] 88041ac6fda0 88041cac9530 88041ac6ffd8 88041ac6ffd8 [694702.097517] 88041ac6ffd8 88041cac9530 0c36 81a417c0 [694702.100019] Call Trace: [694702.100793] [] schedule+0x29/0x70 [694702.102338] [] schedule_preempt_disabled+0xe/0x10 [694702.104309] [] __mutex_lock_slowpath+0x112/0x1b0 [694702.198316] [] mutex_lock+0x2a/0x50 [694702.292456] [] cgroup_lock_live_group+0x1d/0x40 [694702.386833] [] cgroup_mkdir+0xa8/0x4b0 [694702.480679] [] vfs_mkdir+0x84/0xd0 [694702.574124] [] SyS_mkdirat+0x5e/0xe0 [694702.666986] [] SyS_mkdir+0x19/0x20 [694702.758969] [] system_call_fastpath+0x16/0x1b [694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds. [694702.935749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [694703.023603] kworker/12:2D 816079c0 0 11512 2 0x [694703.109993] Workqueue: events cgroup_free_fn [694703.193213] 88041b9dfcf8 0046 88041da6e9f0 ea00106fd240 [694703.278353] 88041f803c00 8803824254c0 88041b9dffd8 88041b9dffd8 [694703.363757] 88041b9dffd8 8803824254c0 001f17887bb1 81a417c0 [694703.448550] Call Trace: [694703.531773] []
3.10.16 cgroup_mutex deadlock
Hello, This morning I had a machine running 3.10.16 go unresponsive but before we killed it we were able to get the information below. I'm not an expert here but it looks like most of the tasks below are blocking waiting on the cgroup_mutex. You can see that the resource_alloca:16502 task is holding the cgroup_mutex and that task appears to be waiting on a lru_add_drain_all() to complete. Initially I thought the deadlock might simply be that the per cpu workqueue work from lru_add_drain_all() is stuck waiting on the cgroup_free_fn to complete. However I've read Documentation/workqueue.txt and it sounds like the current workqueue has multiple kworker threads per cpu and thus this should not happen. Both the cgroup_free_fn work and lru_add_dran_all() work run on the system_wq which has max_active set to 0 so I believe multiple kworker threads should run. This also appears to be true since all of the cgroup_free_fn are running on kworker/12 thread and there are multiple blocked. Perhaps someone with more experience in the cgroup and workqueue code can look at the stacks below and identify the problem, or explain why the lru_add_drain_all() work has not completed: [694702.013850] INFO: task systemd:1 blocked for more than 120 seconds. [694702.015794] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.018217] systemd D 81607820 0 1 0 0x [694702.020505] 88041dcc1d78 0086 88041dc7f100 8110ad54 [694702.023006] 0001 88041dc78000 88041dcc1fd8 88041dcc1fd8 [694702.025508] 88041dcc1fd8 88041dc78000 88041a1e8698 81a417c0 [694702.028011] Call Trace: [694702.028788] [8110ad54] ? vma_merge+0x124/0x330 [694702.030468] [814b8eb9] schedule+0x29/0x70 [694702.032011] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.033982] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.035926] [8112a2bd] ? kmem_cache_alloc_trace+0x12d/0x160 [694702.037948] [814b742a] mutex_lock+0x2a/0x50 [694702.039546] [81095b77] proc_cgroup_show+0x67/0x1d0 [694702.041330] [8115925b] seq_read+0x16b/0x3e0 [694702.042927] [811383d0] vfs_read+0xb0/0x180 [694702.044498] [81138652] SyS_read+0x52/0xa0 [694702.046042] [814c2182] system_call_fastpath+0x16/0x1b [694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds. [694702.050044] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.052467] kworker/12:1D 0 203 2 0x [694702.054756] Workqueue: events cgroup_free_fn [694702.056139] 88041bc1fcf8 0046 88038e7b46a0 00030001 [694702.058642] 88041bc1fd84 88041da6e9f0 88041bc1ffd8 88041bc1ffd8 [694702.061144] 88041bc1ffd8 88041da6e9f0 0087 81a417c0 [694702.063647] Call Trace: [694702.064423] [814b8eb9] schedule+0x29/0x70 [694702.065966] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.067936] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.069879] [814b742a] mutex_lock+0x2a/0x50 [694702.071476] [810930ec] cgroup_free_fn+0x2c/0x120 [694702.073209] [81057c54] process_one_work+0x174/0x490 [694702.075019] [81058d0c] worker_thread+0x11c/0x370 [694702.076748] [81058bf0] ? manage_workers+0x2c0/0x2c0 [694702.078560] [8105f0b0] kthread+0xc0/0xd0 [694702.080078] [8105eff0] ? flush_kthread_worker+0xb0/0xb0 [694702.081995] [814c20dc] ret_from_fork+0x7c/0xb0 [694702.083671] [8105eff0] ? flush_kthread_worker+0xb0/0xb0 [694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 seconds. [694702.087801] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [694702.090225] systemd-logind D 81607820 0 2885 1 0x [694702.092513] 88041ac6fd88 0082 88041dd8aa60 88041d9bc1a8 [694702.095014] 88041ac6fda0 88041cac9530 88041ac6ffd8 88041ac6ffd8 [694702.097517] 88041ac6ffd8 88041cac9530 0c36 81a417c0 [694702.100019] Call Trace: [694702.100793] [814b8eb9] schedule+0x29/0x70 [694702.102338] [814b918e] schedule_preempt_disabled+0xe/0x10 [694702.104309] [814b75b2] __mutex_lock_slowpath+0x112/0x1b0 [694702.198316] [814b742a] mutex_lock+0x2a/0x50 [694702.292456] [8108fa6d] cgroup_lock_live_group+0x1d/0x40 [694702.386833] [810946c8] cgroup_mkdir+0xa8/0x4b0 [694702.480679] [81145ea4] vfs_mkdir+0x84/0xd0 [694702.574124] [8114791e] SyS_mkdirat+0x5e/0xe0 [694702.666986] [811479b9] SyS_mkdir+0x19/0x20 [694702.758969] [814c2182] system_call_fastpath+0x16/0x1b [694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds. [694702.935749]
3.10.16 general protection fault kmem_cache_alloc+0x67/0x170
I had a machine crash this weekend running a 3.10.16 kernel that additionally has a few backported networking patches for performance improvements. At this point I can't rule out that the bug isn't from those patches, and I haven't yet tried to see if I can reproduce the crash. I did happen to have kdump configured so I've got a crash dump that I've been poking at but I'm not an expert here so hopefully someone can provide some guidance on what I'm looking at and/or where the bug might be. Below is the more detailed info with some of my comments interspersed. If anyone has any questions or suggestions I'd appreciate it. [1448642.601229] general protection fault: [#1] SMP [1448642.602448] Modules linked in: mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas uhci_hcd dm_mirror dm_region_hash dm_log dm_mod [1448642.616810] CPU: 11 PID: 27807 Comm: primary_nic_is_ Not tainted 3.10.16-1.rgm.fc16.x86_64 #1 [1448642.618941] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 07/24/2012 [1448642.620639] task: 8806628c3880 ti: 88060437 task.ti: 88060437 [1448642.622335] RIP: 0010:[] [] kmem_cache_alloc+0x67/0x170 [1448642.624286] RSP: 0018:880604371d70 EFLAGS: 00010282 [1448642.625500] RAX: RBX: 8806628c3880 RCX: 7ecb996a [1448642.627415] RDX: 7ecb9969 RSI: 00d0 RDI: 00015900 [1448642.629077] RBP: 880604371dc0 R08: 880667d55900 R09: [1448642.630697] R10: R11: 00015ea8 R12: 880c67003800 [1448642.632316] R13: d17b94d6641aebfb R14: 81064d68 R15: 00d0 [1448642.633936] FS: 7f8018827700() GS:880667d4() knlGS: [1448642.635768] CS: 0010 DS: ES: CR0: 8005003b [1448642.637497] CR2: 006eded4 CR3: 00066368b000 CR4: 07e0 [1448642.639230] DR0: DR1: DR2: [1448642.640849] DR3: DR6: 0ff0 DR7: 0400 [1448642.642468] Stack: [1448642.642950] ff9c 8806628c3880 8806628c3880 0002 [1448642.644780] 8806628c3880 8806628c3880 01200011 [1448642.646848] 7f80188279d0 880604371de0 81064d68 [1448642.648987] Call Trace: [1448642.649568] [] prepare_creds+0x28/0x160 [1448642.650822] [] copy_creds+0x36/0x160 [1448642.652019] [] copy_process+0x310/0x14b0 [1448642.653295] [] ? __alloc_fd+0x45/0x110 [1448642.654529] [] do_fork+0x9c/0x280 [1448642.655668] [] ? get_unused_fd_flags+0x30/0x40 [1448642.657473] [] ? __do_pipe_flags+0x7f/0xc0 [1448642.658808] [] ? __fd_install+0x2b/0x60 [1448642.660062] [] SyS_clone+0x16/0x20 [1448642.661222] [] stub_clone+0x69/0x90 [1448642.662399] [] ? system_call_fastpath+0x16/0x1b [1448642.663807] Code: 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 f7 00 00 00 48 85 c0 0f 84 ee 00 00 00 49 63 44 24 20 48 8d 4a 01 49 8b 3c 24 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b5 49 [1448642.671081] RIP [] kmem_cache_alloc+0x67/0x170 [1448642.672508] RSP [1448642.673330] ---[ end trace fe4b503d6f77c801 ]--- [1448642.674408] general protection fault: [#2] SMP [1448642.675623] Modules linked in: mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas uhci_hcd dm_mirror dm_region_hash dm_log dm_mod [1448642.690185] CPU: 11 PID: 27807 Comm: primary_nic_is_ Tainted: G D 3.10.16-1.rgm.fc16.x86_64 #1 [1448642.692328] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 07/24/2012 [1448642.694027] task: 8806628c3880 ti: 88060437 task.ti: 88060437 [1448642.695726] RIP: 0010:[] [] kmem_cache_alloc+0x67/0x170 [1448642.698133] RSP: 0018:880667d43ad0 EFLAGS: 00010282 [1448642.792981] RAX: RBX: 81a9c580 RCX: 7ecb996a [1448642.889673] RDX: 7ecb9969 RSI: 0020 RDI: 00015900 [1448642.994115] RBP: 880667d43b20 R08: 880667d55900 R09: 81a9e5a0 [1448643.090682] R10:
3.10.16 general protection fault kmem_cache_alloc+0x67/0x170
I had a machine crash this weekend running a 3.10.16 kernel that additionally has a few backported networking patches for performance improvements. At this point I can't rule out that the bug isn't from those patches, and I haven't yet tried to see if I can reproduce the crash. I did happen to have kdump configured so I've got a crash dump that I've been poking at but I'm not an expert here so hopefully someone can provide some guidance on what I'm looking at and/or where the bug might be. Below is the more detailed info with some of my comments interspersed. If anyone has any questions or suggestions I'd appreciate it. [1448642.601229] general protection fault: [#1] SMP [1448642.602448] Modules linked in: mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas uhci_hcd dm_mirror dm_region_hash dm_log dm_mod [1448642.616810] CPU: 11 PID: 27807 Comm: primary_nic_is_ Not tainted 3.10.16-1.rgm.fc16.x86_64 #1 [1448642.618941] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 07/24/2012 [1448642.620639] task: 8806628c3880 ti: 88060437 task.ti: 88060437 [1448642.622335] RIP: 0010:[8112b117] [8112b117] kmem_cache_alloc+0x67/0x170 [1448642.624286] RSP: 0018:880604371d70 EFLAGS: 00010282 [1448642.625500] RAX: RBX: 8806628c3880 RCX: 7ecb996a [1448642.627415] RDX: 7ecb9969 RSI: 00d0 RDI: 00015900 [1448642.629077] RBP: 880604371dc0 R08: 880667d55900 R09: [1448642.630697] R10: R11: 00015ea8 R12: 880c67003800 [1448642.632316] R13: d17b94d6641aebfb R14: 81064d68 R15: 00d0 [1448642.633936] FS: 7f8018827700() GS:880667d4() knlGS: [1448642.635768] CS: 0010 DS: ES: CR0: 8005003b [1448642.637497] CR2: 006eded4 CR3: 00066368b000 CR4: 07e0 [1448642.639230] DR0: DR1: DR2: [1448642.640849] DR3: DR6: 0ff0 DR7: 0400 [1448642.642468] Stack: [1448642.642950] ff9c 8806628c3880 8806628c3880 0002 [1448642.644780] 8806628c3880 8806628c3880 01200011 [1448642.646848] 7f80188279d0 880604371de0 81064d68 [1448642.648987] Call Trace: [1448642.649568] [81064d68] prepare_creds+0x28/0x160 [1448642.650822] [81065436] copy_creds+0x36/0x160 [1448642.652019] [810393e0] copy_process+0x310/0x14b0 [1448642.653295] [811534f5] ? __alloc_fd+0x45/0x110 [1448642.654529] [8103a64c] do_fork+0x9c/0x280 [1448642.655668] [811535f0] ? get_unused_fd_flags+0x30/0x40 [1448642.657473] [8114147f] ? __do_pipe_flags+0x7f/0xc0 [1448642.658808] [8115362b] ? __fd_install+0x2b/0x60 [1448642.660062] [8103a8b6] SyS_clone+0x16/0x20 [1448642.661222] [814c2429] stub_clone+0x69/0x90 [1448642.662399] [814c2182] ? system_call_fastpath+0x16/0x1b [1448642.663807] Code: 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 f7 00 00 00 48 85 c0 0f 84 ee 00 00 00 49 63 44 24 20 48 8d 4a 01 49 8b 3c 24 49 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b5 49 [1448642.671081] RIP [8112b117] kmem_cache_alloc+0x67/0x170 [1448642.672508] RSP 880604371d70 [1448642.673330] ---[ end trace fe4b503d6f77c801 ]--- [1448642.674408] general protection fault: [#2] SMP [1448642.675623] Modules linked in: mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas uhci_hcd dm_mirror dm_region_hash dm_log dm_mod [1448642.690185] CPU: 11 PID: 27807 Comm: primary_nic_is_ Tainted: G D 3.10.16-1.rgm.fc16.x86_64 #1 [1448642.692328] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 07/24/2012 [1448642.694027] task: 8806628c3880 ti: 88060437 task.ti: 88060437 [1448642.695726] RIP: 0010:[8112b117] [8112b117] kmem_cache_alloc+0x67/0x170 [1448642.698133] RSP: 0018:880667d43ad0 EFLAGS: 00010282
[tip:sched/core] sched/rt: Remove redundant nr_cpus_allowed test
Commit-ID: 6bfa687c19b7ab8adee03f0d43c197c2945dd869 Gitweb: http://git.kernel.org/tip/6bfa687c19b7ab8adee03f0d43c197c2945dd869 Author: Shawn Bohrer AuthorDate: Fri, 4 Oct 2013 14:24:53 -0500 Committer: Ingo Molnar CommitDate: Sun, 6 Oct 2013 11:28:40 +0200 sched/rt: Remove redundant nr_cpus_allowed test In 76854c7e8f3f4172fef091e78d88b3b751463ac6 ("sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles") an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p->nr_cpus_allowed > 1 check redundant, which can now be removed. Signed-off-by: Shawn Bohrer Reviewed-by: Steven Rostedt Cc: Mike Galbraith Cc: t...@rgmadvisors.com Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.boh...@gmail.com Signed-off-by: Ingo Molnar --- kernel/sched/rt.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 01970c8..ceebfba 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags) */ if (curr && unlikely(rt_task(curr)) && (curr->nr_cpus_allowed < 2 || -curr->prio <= p->prio) && - (p->nr_cpus_allowed > 1)) { +curr->prio <= p->prio)) { int target = find_lowest_rq(p); if (target != -1) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/rt: Remove redundant nr_cpus_allowed test
Commit-ID: 6bfa687c19b7ab8adee03f0d43c197c2945dd869 Gitweb: http://git.kernel.org/tip/6bfa687c19b7ab8adee03f0d43c197c2945dd869 Author: Shawn Bohrer sboh...@rgmadvisors.com AuthorDate: Fri, 4 Oct 2013 14:24:53 -0500 Committer: Ingo Molnar mi...@kernel.org CommitDate: Sun, 6 Oct 2013 11:28:40 +0200 sched/rt: Remove redundant nr_cpus_allowed test In 76854c7e8f3f4172fef091e78d88b3b751463ac6 (sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles) an optimization was added to select_task_rq_rt() that immediately returns when p-nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p-nr_cpus_allowed 1 check redundant, which can now be removed. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com Reviewed-by: Steven Rostedt rost...@goodmis.org Cc: Mike Galbraith mgalbra...@suse.de Cc: t...@rgmadvisors.com Cc: Peter Zijlstra pet...@infradead.org Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.boh...@gmail.com Signed-off-by: Ingo Molnar mi...@kernel.org --- kernel/sched/rt.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 01970c8..ceebfba 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags) */ if (curr unlikely(rt_task(curr)) (curr-nr_cpus_allowed 2 || -curr-prio = p-prio) - (p-nr_cpus_allowed 1)) { +curr-prio = p-prio)) { int target = find_lowest_rq(p); if (target != -1) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched/rt: Remove redundant nr_cpus_allowed test
From: Shawn Bohrer In 76854c7e8f3f4172fef091e78d88b3b751463ac6 "sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles" an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p->nr_cpus_allowed > 1 check redundant and can be removed. Signed-off-by: Shawn Bohrer --- kernel/sched/rt.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 01970c8..ceebfba 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags) */ if (curr && unlikely(rt_task(curr)) && (curr->nr_cpus_allowed < 2 || -curr->prio <= p->prio) && - (p->nr_cpus_allowed > 1)) { +curr->prio <= p->prio)) { int target = find_lowest_rq(p); if (target != -1) -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched/rt: Remove redundant nr_cpus_allowed test
From: Shawn Bohrer sboh...@rgmadvisors.com In 76854c7e8f3f4172fef091e78d88b3b751463ac6 sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles an optimization was added to select_task_rq_rt() that immediately returns when p-nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p-nr_cpus_allowed 1 check redundant and can be removed. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- kernel/sched/rt.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 01970c8..ceebfba 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags) */ if (curr unlikely(rt_task(curr)) (curr-nr_cpus_allowed 2 || -curr-prio = p-prio) - (p-nr_cpus_allowed 1)) { +curr-prio = p-prio)) { int target = find_lowest_rq(p); if (target != -1) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] USB: Fix compilation error when CONFIG_PM disabled
Commit 9a11899c5e699a8d "USB: OHCI: add missing PCI PM callbacks to ohci-pci.c" Introduced the following compilation errors when power management is disabled: drivers/usb/host/ohci-pci.c: In function 'ohci_pci_init': drivers/usb/host/ohci-pci.c:309:35: error: 'ohci_suspend' undeclared (first use in this function) drivers/usb/host/ohci-pci.c:309:35: note: each undeclared identifier is reported only once for each function it appears in drivers/usb/host/ohci-pci.c:310:34: error: 'ohci_resume' undeclared (first use in this function) ohci_suspend and ohci_resume are only defined when CONFIG_PM is defined so only use them under CONFIG_PM. Signed-off-by: Shawn Bohrer --- drivers/usb/host/ohci-pci.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/usb/host/ohci-pci.c b/drivers/usb/host/ohci-pci.c index 0f1d193..062b410 100644 --- a/drivers/usb/host/ohci-pci.c +++ b/drivers/usb/host/ohci-pci.c @@ -305,9 +305,11 @@ static int __init ohci_pci_init(void) ohci_init_driver(_pci_hc_driver, _overrides); +#ifdef CONFIG_PM /* Entries for the PCI suspend/resume callbacks are special */ ohci_pci_hc_driver.pci_suspend = ohci_suspend; ohci_pci_hc_driver.pci_resume = ohci_resume; +#endif return pci_register_driver(_pci_driver); } -- 1.7.7.6 -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] USB: Fix compilation error when CONFIG_PM disabled
Commit 9a11899c5e699a8d USB: OHCI: add missing PCI PM callbacks to ohci-pci.c Introduced the following compilation errors when power management is disabled: drivers/usb/host/ohci-pci.c: In function 'ohci_pci_init': drivers/usb/host/ohci-pci.c:309:35: error: 'ohci_suspend' undeclared (first use in this function) drivers/usb/host/ohci-pci.c:309:35: note: each undeclared identifier is reported only once for each function it appears in drivers/usb/host/ohci-pci.c:310:34: error: 'ohci_resume' undeclared (first use in this function) ohci_suspend and ohci_resume are only defined when CONFIG_PM is defined so only use them under CONFIG_PM. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- drivers/usb/host/ohci-pci.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/usb/host/ohci-pci.c b/drivers/usb/host/ohci-pci.c index 0f1d193..062b410 100644 --- a/drivers/usb/host/ohci-pci.c +++ b/drivers/usb/host/ohci-pci.c @@ -305,9 +305,11 @@ static int __init ohci_pci_init(void) ohci_init_driver(ohci_pci_hc_driver, pci_overrides); +#ifdef CONFIG_PM /* Entries for the PCI suspend/resume callbacks are special */ ohci_pci_hc_driver.pci_suspend = ohci_suspend; ohci_pci_hc_driver.pci_resume = ohci_resume; +#endif return pci_register_driver(ohci_pci_driver); } -- 1.7.7.6 -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] net: rename and move busy poll mib counter
On Tue, Aug 06, 2013 at 03:14:48AM -0700, Eric Dumazet wrote: > On Tue, 2013-08-06 at 12:52 +0300, Eliezer Tamir wrote: > > Move the low latency mib counter to the ip section. > > Rename it from low latency to busy poll. > > > > Reported-by: Shawn Bohrer > > Signed-off-by: Eliezer Tamir > > --- > > Well, it should not be part of IP mib, but a socket one (not existing so > far) > > Linux MIB already contains few non TCP counters : > > LINUX_MIB_ARPFILTER > LINUX_MIB_IPRPFILTER Doesn't mean they are in the correct place either, but perhaps it's too late for them. > Its mostly populated by TCP counters, sure. See, on the kernel side these are called "LINUX_MIB*" which seems perfectly sane and I wouldn't even think the statistic is out of place. On the user-mode side these are all reported in /proc/net/netstat as TcpExt statistics. I can tell you that I don't look at TCP statistics when I'm debugging/testing UDP issues (apparently I should). -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] net: rename and move busy poll mib counter
On Tue, Aug 06, 2013 at 03:14:48AM -0700, Eric Dumazet wrote: On Tue, 2013-08-06 at 12:52 +0300, Eliezer Tamir wrote: Move the low latency mib counter to the ip section. Rename it from low latency to busy poll. Reported-by: Shawn Bohrer sboh...@rgmadvisors.com Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com --- Well, it should not be part of IP mib, but a socket one (not existing so far) Linux MIB already contains few non TCP counters : LINUX_MIB_ARPFILTER LINUX_MIB_IPRPFILTER Doesn't mean they are in the correct place either, but perhaps it's too late for them. Its mostly populated by TCP counters, sure. See, on the kernel side these are called LINUX_MIB* which seems perfectly sane and I wouldn't even think the statistic is out of place. On the user-mode side these are all reported in /proc/net/netstat as TcpExt statistics. I can tell you that I don't look at TCP statistics when I'm debugging/testing UDP issues (apparently I should). -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10-rc4 stalls during mmap writes
On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote: > On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote: > > On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote: > > > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote: > > So to summarize it appears that most of the time was spent with > > kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken > > up by a softirq, is the one that calls xfs_buf_unlock(). Assuming I'm > > not missing some important intermediate steps does this provide any > > more information about what resource I'm actually waiting for? Does > > this point to any changes that happened after 3.4? Are there any tips > > that could help minimize these contentions? > > The only difference between this and 3.4 is the allocation workqueue > thread. That, however, won't be introducing second long delays. What > you are seeing here is simply a the latency of waiting for > background metadata IO to complete during an allocation which has > the ilock held Again thank you for your analysis Dave. I've taken a step back to look at the big picture and that allowed me to identify what _has_ changed between 3.4 and 3.10. What changed is the behavior of vm.dirty_expire_centisecs. Honestly, the previous behavior never made any sense to me and I'm not entirely sure the current behavior does either. In the workload I've been debugging we append data to many small files using mmap. The writes are small and the total data rate is very low thus for most files it may take several minutes to fill a page. Having low-latency writes are important, but as you know stalls are always possible. One way to reduce the probability of a stall is to reduce the frequency of writeback, and adjusting vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should allow us to do that. On kernels 3.4 and older we chose to increase vm.dirty_expire_centisecs to 3 since we can comfortably loose 5 minutes of data in the event of a system failure and we believed this would cause a fairly consistent low data rate as every vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages that were vm.dirty_expire_centisecs (5min) old. On old kernels that isn't exactly what happened. Instead every 5 minutes there would be a burst of writeback and a slow trickle at all other times. This also reduced the total amount of data written back since the same dirty page wasn't written back every 30 seconds. This also virtually eliminated the stalls we saw so it was left alone. On 3.10 vm.dirty_expire_centisecs=3 no longer does the same thing. Honestly I'm not sure what it does, but the result is a fairly consistent high data rate being written back to disk. The fact that is consistent might lead me to believe that it writes back all pages that are vm.dirty_expire_centisecs old every vm.dirty_writeback_centisecs, but the data rate is far too high for that to be true. It appears that I can effectively get the same old behavior by setting vm.dirty_writeback_centisecs=3. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10-rc4 stalls during mmap writes
On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote: On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote: On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote: On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote: So to summarize it appears that most of the time was spent with kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken up by a softirq, is the one that calls xfs_buf_unlock(). Assuming I'm not missing some important intermediate steps does this provide any more information about what resource I'm actually waiting for? Does this point to any changes that happened after 3.4? Are there any tips that could help minimize these contentions? The only difference between this and 3.4 is the allocation workqueue thread. That, however, won't be introducing second long delays. What you are seeing here is simply a the latency of waiting for background metadata IO to complete during an allocation which has the ilock held Again thank you for your analysis Dave. I've taken a step back to look at the big picture and that allowed me to identify what _has_ changed between 3.4 and 3.10. What changed is the behavior of vm.dirty_expire_centisecs. Honestly, the previous behavior never made any sense to me and I'm not entirely sure the current behavior does either. In the workload I've been debugging we append data to many small files using mmap. The writes are small and the total data rate is very low thus for most files it may take several minutes to fill a page. Having low-latency writes are important, but as you know stalls are always possible. One way to reduce the probability of a stall is to reduce the frequency of writeback, and adjusting vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should allow us to do that. On kernels 3.4 and older we chose to increase vm.dirty_expire_centisecs to 3 since we can comfortably loose 5 minutes of data in the event of a system failure and we believed this would cause a fairly consistent low data rate as every vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages that were vm.dirty_expire_centisecs (5min) old. On old kernels that isn't exactly what happened. Instead every 5 minutes there would be a burst of writeback and a slow trickle at all other times. This also reduced the total amount of data written back since the same dirty page wasn't written back every 30 seconds. This also virtually eliminated the stalls we saw so it was left alone. On 3.10 vm.dirty_expire_centisecs=3 no longer does the same thing. Honestly I'm not sure what it does, but the result is a fairly consistent high data rate being written back to disk. The fact that is consistent might lead me to believe that it writes back all pages that are vm.dirty_expire_centisecs old every vm.dirty_writeback_centisecs, but the data rate is far too high for that to be true. It appears that I can effectively get the same old behavior by setting vm.dirty_writeback_centisecs=3. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10-rc4 stalls during mmap writes
On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote: > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote: > > So I guess my question is does anyone know why I'm now seeing these > > stalls with 3.10? > > Because we made all metadata updates in XFS fully transactional in > 3.4: > > commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88 > Author: Christoph Hellwig > Date: Wed Feb 29 09:53:52 2012 + > > xfs: log timestamp updates > > Timestamps on regular files are the last metadata that XFS does not update > transactionally. Now that we use the delaylog mode exclusively and made > the log scode scale extremly well there is no need to bypass that code for > timestamp updates. Logging all updates allows to drop a lot of code, and > will allow for further performance improvements later on. > > Note that this patch drops optimized handling of fdatasync - it will be > added back in a separate commit. > > Reviewed-by: Dave Chinner > Signed-off-by: Christoph Hellwig > Reviewed-by: Mark Tinguely > Signed-off-by: Ben Myers > $ git describe --contains 8a9c998 > v3.4-rc1~55^2~23 > > IOWs, you're just lucky you haven't noticed it on 3.4 > > > Are there any suggestions for how to eliminate them? > > Nope. You're stuck with it - there's far more places in the page > fault path where you can get stuck on the same lock for the same > reason - e.g. during block mapping for the newly added pagecache > page... > > Hint: mmap() does not provide -deterministic- low latency access to > mapped pages - it is only "mostly low latency". Hi Dave, I appreciate your time and analysis. I am sadly aware that doing file I/O and expecting low latency is a bit of a stretch. This is also not the first time I've battled these types of stalls and realize that the best I can do is search for opportunities to reduce the probability of a stall, or find ways to reduce the duration of a stall. In this case since updating timestamps has been transactional and in place since 3.4, it is obvious to me that this is not the cause of both the increased rate and duration of the stalls on 3.10. I assure you on 3.4 we have very few stalls that are greater than 10 milliseconds in our normal workload and with 3.10 I'm seeing them regularly. Since we know I can, and likely do, hit the same code path on 3.4 that tells me that me that the xfs_ilock() is likely being held for a longer duration on the new kernel. Let's see if we can determine why the lock is held for so long. Here is an attempt at that, as events unfold in time order. I'm in no way a filesystem developer so any input or analysis of what we're waiting on is appreciated. There is also multiple kworker threads involved which certainly complicates things. It starts with kworker/u49:0 which aquires xfs_ilock() inside iomap_write_allocate(). kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry: | xfs_iomap_write_allocate() { kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:0.074 us |xfs_ilock(); In the next two chunks it appears that kworker/u49:0 calls xfs_bmapi_allocate which offloads that work to kworker/4:0 calling __xfs_bmapi_allocate(). kworker/4:0 ends up blocking on xfs_buf_lock(). kworker/4:0-27520 [004] 256032.180389: sched_switch: prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D ==> next_comm=kworker/u49:0 next_pid=15748 next_prio=120 kworker/4:0-27520 [004] 256032.180393: kernel_stack: => schedule (814ca379) => schedule_timeout (814c810d) => __down_common (814c8e5e) => __down (814c8f26) => down (810658e1) => xfs_buf_lock (811c12a4) => _xfs_buf_find (811c1469) => xfs_buf_get_map (811c16e4) => xfs_buf_read_map (811c2691) => xfs_trans_read_buf_map (81225fa9) => xfs_btree_read_buf_block.constprop.6 (811f2242) => xfs_btree_lookup_get_block (811f22fb) => xfs_btree_lookup (811f6707) => xfs_alloc_ag_vextent_near (811d9d52) => xfs_alloc_ag_vextent (811da8b5) => xfs_alloc_vextent (811db545) => xfs_bmap_btalloc (811e6951) => xfs_bmap_alloc (811e6dee) => __xfs_bmapi_allocate (811ec024) => xfs_bmapi_allocate_worker (811ec283) => process_one_work (81059104) => worker_thread (8105a1bc) => kthread (810605f0) => ret_from_fork (814d395c) kworker/u49:0-15748 [004] 256032.180403: sched_switch: prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D ==> next_comm=kworker/4:1H next_pid=3921 next_prio=100 kworker/u49:0-15748 [004] 256032.180408: kernel_stack: =
Re: 3.10-rc4 stalls during mmap writes
On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote: On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote: So I guess my question is does anyone know why I'm now seeing these stalls with 3.10? Because we made all metadata updates in XFS fully transactional in 3.4: commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88 Author: Christoph Hellwig h...@infradead.org Date: Wed Feb 29 09:53:52 2012 + xfs: log timestamp updates Timestamps on regular files are the last metadata that XFS does not update transactionally. Now that we use the delaylog mode exclusively and made the log scode scale extremly well there is no need to bypass that code for timestamp updates. Logging all updates allows to drop a lot of code, and will allow for further performance improvements later on. Note that this patch drops optimized handling of fdatasync - it will be added back in a separate commit. Reviewed-by: Dave Chinner dchin...@redhat.com Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Mark Tinguely tingu...@sgi.com Signed-off-by: Ben Myers b...@sgi.com $ git describe --contains 8a9c998 v3.4-rc1~55^2~23 IOWs, you're just lucky you haven't noticed it on 3.4 Are there any suggestions for how to eliminate them? Nope. You're stuck with it - there's far more places in the page fault path where you can get stuck on the same lock for the same reason - e.g. during block mapping for the newly added pagecache page... Hint: mmap() does not provide -deterministic- low latency access to mapped pages - it is only mostly low latency. Hi Dave, I appreciate your time and analysis. I am sadly aware that doing file I/O and expecting low latency is a bit of a stretch. This is also not the first time I've battled these types of stalls and realize that the best I can do is search for opportunities to reduce the probability of a stall, or find ways to reduce the duration of a stall. In this case since updating timestamps has been transactional and in place since 3.4, it is obvious to me that this is not the cause of both the increased rate and duration of the stalls on 3.10. I assure you on 3.4 we have very few stalls that are greater than 10 milliseconds in our normal workload and with 3.10 I'm seeing them regularly. Since we know I can, and likely do, hit the same code path on 3.4 that tells me that me that the xfs_ilock() is likely being held for a longer duration on the new kernel. Let's see if we can determine why the lock is held for so long. Here is an attempt at that, as events unfold in time order. I'm in no way a filesystem developer so any input or analysis of what we're waiting on is appreciated. There is also multiple kworker threads involved which certainly complicates things. It starts with kworker/u49:0 which aquires xfs_ilock() inside iomap_write_allocate(). kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry: | xfs_iomap_write_allocate() { kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:0.074 us |xfs_ilock(); In the next two chunks it appears that kworker/u49:0 calls xfs_bmapi_allocate which offloads that work to kworker/4:0 calling __xfs_bmapi_allocate(). kworker/4:0 ends up blocking on xfs_buf_lock(). kworker/4:0-27520 [004] 256032.180389: sched_switch: prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D == next_comm=kworker/u49:0 next_pid=15748 next_prio=120 kworker/4:0-27520 [004] 256032.180393: kernel_stack: stack trace = schedule (814ca379) = schedule_timeout (814c810d) = __down_common (814c8e5e) = __down (814c8f26) = down (810658e1) = xfs_buf_lock (811c12a4) = _xfs_buf_find (811c1469) = xfs_buf_get_map (811c16e4) = xfs_buf_read_map (811c2691) = xfs_trans_read_buf_map (81225fa9) = xfs_btree_read_buf_block.constprop.6 (811f2242) = xfs_btree_lookup_get_block (811f22fb) = xfs_btree_lookup (811f6707) = xfs_alloc_ag_vextent_near (811d9d52) = xfs_alloc_ag_vextent (811da8b5) = xfs_alloc_vextent (811db545) = xfs_bmap_btalloc (811e6951) = xfs_bmap_alloc (811e6dee) = __xfs_bmapi_allocate (811ec024) = xfs_bmapi_allocate_worker (811ec283) = process_one_work (81059104) = worker_thread (8105a1bc) = kthread (810605f0) = ret_from_fork (814d395c) kworker/u49:0-15748 [004] 256032.180403: sched_switch: prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D == next_comm=kworker/4:1H next_pid=3921 next_prio=100 kworker/u49:0-15748 [004] 256032.180408: kernel_stack: stack trace = schedule (814ca379) = schedule_timeout (814c810d) = wait_for_completion (814ca2e5) = xfs_bmapi_allocate (811eea57) = xfs_bmapi_write (811eef33
3.10-rc4 stalls during mmap writes
I've started testing the 3.10 kernel, previously I was on 3.4, and I'm encounting some fairly large stalls in my memory mapped writes in the range of .01 to 1s. I've managed to capture two of these stalls so far and both looked like the following: 1) Writing process writes to a new page and blocks on xfs_ilock: <...>-21567 [009] 9435.453069: sched_switch: prev_comm=tick_receiver_m prev_pid=21567 prev_prio=79 prev_state=D ==> next_comm=swapper/9 next_pid=0 next_prio=120 <...>-21567 [009] 9435.453072: kernel_stack: => schedule (814ca379) => rwsem_down_write_failed (814cb095) => call_rwsem_down_write_failed (81275053) => xfs_ilock (8120b25c) => xfs_vn_update_time (811cf3d3) => update_time (81158dd3) => file_update_time (81158f0c) => block_page_mkwrite (81171d23) => xfs_vm_page_mkwrite (811c5375) => do_wp_page (8110c27f) => handle_pte_fault (8110dd24) => handle_mm_fault (8110f430) => __do_page_fault (814cef72) => do_page_fault (814cf2e7) => page_fault (814cbab2) 2) kworker calls xfs_iunlock and wakes up my process: kworker/u50:1-403 [013] 9436.027354: sched_wakeup: comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009 kworker/u50:1-403 [013] 9436.027359: kernel_stack: => ttwu_do_activate.constprop.34 (8106c556) => try_to_wake_up (8106e996) => wake_up_process (8106ea87) => __rwsem_do_wake (8126e531) => rwsem_wake (8126e62a) => call_rwsem_wake (81275077) => xfs_iunlock (8120b55c) => xfs_iomap_write_allocate (811ce4e7) => xfs_map_blocks (811bf145) => xfs_vm_writepage (811bfbc2) => __writepage (810f14e7) => write_cache_pages (810f189e) => generic_writepages (810f1b3a) => xfs_vm_writepages (811bef8d) => do_writepages (810f3380) => __writeback_single_inode (81166ae5) => writeback_sb_inodes (81167d4d) => __writeback_inodes_wb (8116800e) => wb_writeback (811682bb) => wb_check_old_data_flush (811683ff) => wb_do_writeback (81169bd1) => bdi_writeback_workfn (81169cca) => process_one_work (81059104) => worker_thread (8105a1bc) => kthread (810605f0) => ret_from_fork (814d395c) In this case my process stalled for roughly half a second: <...>-21567 [009] 9436.027388: print:tracing_mark_write: stall of 0.574282 So I guess my question is does anyone know why I'm now seeing these stalls with 3.10? Are there any suggestions for how to eliminate them? # xfs_info /home/ meta-data=/dev/sda5 isize=256agcount=4, agsize=67774016 blks = sectsz=512 attr=2 data = bsize=4096 blocks=271096064, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=132371, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # grep sda5 /proc/mounts /dev/sda5 /home xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,noquota 0 0 Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.10-rc4 stalls during mmap writes
I've started testing the 3.10 kernel, previously I was on 3.4, and I'm encounting some fairly large stalls in my memory mapped writes in the range of .01 to 1s. I've managed to capture two of these stalls so far and both looked like the following: 1) Writing process writes to a new page and blocks on xfs_ilock: ...-21567 [009] 9435.453069: sched_switch: prev_comm=tick_receiver_m prev_pid=21567 prev_prio=79 prev_state=D == next_comm=swapper/9 next_pid=0 next_prio=120 ...-21567 [009] 9435.453072: kernel_stack: stack trace = schedule (814ca379) = rwsem_down_write_failed (814cb095) = call_rwsem_down_write_failed (81275053) = xfs_ilock (8120b25c) = xfs_vn_update_time (811cf3d3) = update_time (81158dd3) = file_update_time (81158f0c) = block_page_mkwrite (81171d23) = xfs_vm_page_mkwrite (811c5375) = do_wp_page (8110c27f) = handle_pte_fault (8110dd24) = handle_mm_fault (8110f430) = __do_page_fault (814cef72) = do_page_fault (814cf2e7) = page_fault (814cbab2) 2) kworker calls xfs_iunlock and wakes up my process: kworker/u50:1-403 [013] 9436.027354: sched_wakeup: comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009 kworker/u50:1-403 [013] 9436.027359: kernel_stack: stack trace = ttwu_do_activate.constprop.34 (8106c556) = try_to_wake_up (8106e996) = wake_up_process (8106ea87) = __rwsem_do_wake (8126e531) = rwsem_wake (8126e62a) = call_rwsem_wake (81275077) = xfs_iunlock (8120b55c) = xfs_iomap_write_allocate (811ce4e7) = xfs_map_blocks (811bf145) = xfs_vm_writepage (811bfbc2) = __writepage (810f14e7) = write_cache_pages (810f189e) = generic_writepages (810f1b3a) = xfs_vm_writepages (811bef8d) = do_writepages (810f3380) = __writeback_single_inode (81166ae5) = writeback_sb_inodes (81167d4d) = __writeback_inodes_wb (8116800e) = wb_writeback (811682bb) = wb_check_old_data_flush (811683ff) = wb_do_writeback (81169bd1) = bdi_writeback_workfn (81169cca) = process_one_work (81059104) = worker_thread (8105a1bc) = kthread (810605f0) = ret_from_fork (814d395c) In this case my process stalled for roughly half a second: ...-21567 [009] 9436.027388: print:tracing_mark_write: stall of 0.574282 So I guess my question is does anyone know why I'm now seeing these stalls with 3.10? Are there any suggestions for how to eliminate them? # xfs_info /home/ meta-data=/dev/sda5 isize=256agcount=4, agsize=67774016 blks = sectsz=512 attr=2 data = bsize=4096 blocks=271096064, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=132371, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # grep sda5 /proc/mounts /dev/sda5 /home xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,noquota 0 0 Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: old version of trace-cmd broken on 3.10 kernel
On Fri, May 31, 2013 at 01:05:35PM -0400, Steven Rostedt wrote: > On Fri, 2013-05-31 at 11:50 -0500, Shawn Bohrer wrote: > > Not sure if this is a big deal or not. I've got an old version of > > trace-cmd. It was built from git on 2012-09-12 but sadly I didn't > > stash away the exact commit hash. Anyway this version works fine on a > > 3.4 kernel but on a 3.10-rc3 kernel it no longer works. I just pulled > > the latest trace-cmd from git and it works fine on the 3.10 kernel so > > maybe this isn't an issue, but I don't typically expect applications > > to break with a kernel upgrade. > > > > When I run the old version on 3.10.0-rc3 I get the following output: > > Yep, in 3.10 I fixed a long standing bug in the splice code, that when > fixed, the old trace-cmd would fail. > > I made a fix to all the stable releases of trace-cmd and posted it to > LKML back on March 1st. > > https://lkml.org/lkml/2013/3/1/596 Thanks Steve, I'll just update my trace-cmd version. It is about time for an update anyway. Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
old version of trace-cmd broken on 3.10 kernel
Not sure if this is a big deal or not. I've got an old version of trace-cmd. It was built from git on 2012-09-12 but sadly I didn't stash away the exact commit hash. Anyway this version works fine on a 3.4 kernel but on a 3.10-rc3 kernel it no longer works. I just pulled the latest trace-cmd from git and it works fine on the 3.10 kernel so maybe this isn't an issue, but I don't typically expect applications to break with a kernel upgrade. When I run the old version on 3.10.0-rc3 I get the following output: $ sudo trace-cmd record -e sched:sched_switch -e sched:sched_wakeup sleep 1 /sys/kernel/debug/tracing/events/sched/sched_wakeup/filter /sys/kernel/debug/tracing/events/sched/sched_switch/filter trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input Kernel buffer statistics: Note: "entries" are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 93 overrun: 0 commit overrun: 0 bytes: 5112 oldest event ts: 7013.613622 now ts: 7013.656459 dropped events: 0 read events: 607 CPU: 1 entries: 88 overrun: 0 commit overrun: 0 bytes: 4396 oldest event ts: 7013.638352 now ts: 7013.656515 dropped events: 0 read events: 960 CPU: 2 entries: 154 overrun: 0 commit overrun: 0 bytes: 7592 oldest event ts: 7013.583332 now ts: 7013.656572 dropped events: 0 read events: 1211 CPU: 3 entries: 150 overrun: 0 commit overrun: 0 bytes: 7324 oldest event ts: 7013.589348 now ts: 7013.656625 dropped events: 0 read events: 1303 CPU: 4 entries: 84 overrun: 0 commit overrun: 0 bytes: 5096 oldest event ts: 7013.621239 now ts: 7013.656677 dropped events: 0 read events: 1175 CPU: 5 entries: 146 overrun: 0 commit overrun: 0 bytes: 9016 oldest event ts: 7013.601549 now ts: 7013.656729 dropped events: 0 read events: 2204 CPU: 6 entries: 77 overrun: 0 commit overrun: 0 bytes: 4824 oldest event ts: 7013.651234 now ts: 7013.656781 dropped events: 0 read events: 2148 CPU: 7 entries: 109 overrun: 0 commit overrun: 0 bytes: 6600 oldest event ts: 7013.621326 now ts: 7013.656837 dropped events: 0 read events: 1672 CPU: 8 entries: 110 overrun: 0 commit overrun: 0 bytes: 6804 oldest event ts: 7013.603146 now ts: 7013.656891 dropped events: 0 read events: 2272 CPU: 9 entries: 142 overrun: 0 commit overrun: 0 bytes: 8496 oldest event ts: 7013.584943 now ts: 7013.656942 dropped events: 0 read events: 1521 CPU: 10 entries: 98 overrun: 0 commit overrun: 0 bytes: 6408 oldest event ts: 7013.617605 now ts: 7013.656995 dropped events: 0 read events: 1706 CPU: 11 entries: 293 overrun: 0 commit overrun: 0 bytes: 19208 oldest event ts: 7013.607094 now ts: 7013.657047 dropped events: 0 read events: 9236 CPU: 12 entries: 152 overrun: 0 commit overrun: 0 bytes: 7136 oldest event ts: 7013.582819 now ts: 7013.657099 dropped events: 0 read events: 1112 CPU: 13 entries: 86 overrun: 0 commit overrun: 0 bytes: 3928 oldest event ts: 7013.560591 now ts: 7013.657150 dropped events: 0 read events: 769 CPU: 14 entries: 85 overrun: 0 commit overrun: 0 bytes: 4076 oldest event ts: 7013.426020 now ts: 7013.657202 dropped events: 0 read events: 586 CPU: 15 entries: 211 overrun: 0 commit overrun: 0 bytes: 9568 oldest event ts: 7013.578705 now ts: 7013.657253 dropped events: 0 read events: 1578 CPU: 16 entries: 114 overrun: 0 commit overrun: 0 bytes: 7104 oldest event ts: 7013.626635 now ts:
old version of trace-cmd broken on 3.10 kernel
Not sure if this is a big deal or not. I've got an old version of trace-cmd. It was built from git on 2012-09-12 but sadly I didn't stash away the exact commit hash. Anyway this version works fine on a 3.4 kernel but on a 3.10-rc3 kernel it no longer works. I just pulled the latest trace-cmd from git and it works fine on the 3.10 kernel so maybe this isn't an issue, but I don't typically expect applications to break with a kernel upgrade. When I run the old version on 3.10.0-rc3 I get the following output: $ sudo trace-cmd record -e sched:sched_switch -e sched:sched_wakeup sleep 1 /sys/kernel/debug/tracing/events/sched/sched_wakeup/filter /sys/kernel/debug/tracing/events/sched/sched_switch/filter trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input trace-cmd: Interrupted system call recorder error in splice input Kernel buffer statistics: Note: entries are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 93 overrun: 0 commit overrun: 0 bytes: 5112 oldest event ts: 7013.613622 now ts: 7013.656459 dropped events: 0 read events: 607 CPU: 1 entries: 88 overrun: 0 commit overrun: 0 bytes: 4396 oldest event ts: 7013.638352 now ts: 7013.656515 dropped events: 0 read events: 960 CPU: 2 entries: 154 overrun: 0 commit overrun: 0 bytes: 7592 oldest event ts: 7013.583332 now ts: 7013.656572 dropped events: 0 read events: 1211 CPU: 3 entries: 150 overrun: 0 commit overrun: 0 bytes: 7324 oldest event ts: 7013.589348 now ts: 7013.656625 dropped events: 0 read events: 1303 CPU: 4 entries: 84 overrun: 0 commit overrun: 0 bytes: 5096 oldest event ts: 7013.621239 now ts: 7013.656677 dropped events: 0 read events: 1175 CPU: 5 entries: 146 overrun: 0 commit overrun: 0 bytes: 9016 oldest event ts: 7013.601549 now ts: 7013.656729 dropped events: 0 read events: 2204 CPU: 6 entries: 77 overrun: 0 commit overrun: 0 bytes: 4824 oldest event ts: 7013.651234 now ts: 7013.656781 dropped events: 0 read events: 2148 CPU: 7 entries: 109 overrun: 0 commit overrun: 0 bytes: 6600 oldest event ts: 7013.621326 now ts: 7013.656837 dropped events: 0 read events: 1672 CPU: 8 entries: 110 overrun: 0 commit overrun: 0 bytes: 6804 oldest event ts: 7013.603146 now ts: 7013.656891 dropped events: 0 read events: 2272 CPU: 9 entries: 142 overrun: 0 commit overrun: 0 bytes: 8496 oldest event ts: 7013.584943 now ts: 7013.656942 dropped events: 0 read events: 1521 CPU: 10 entries: 98 overrun: 0 commit overrun: 0 bytes: 6408 oldest event ts: 7013.617605 now ts: 7013.656995 dropped events: 0 read events: 1706 CPU: 11 entries: 293 overrun: 0 commit overrun: 0 bytes: 19208 oldest event ts: 7013.607094 now ts: 7013.657047 dropped events: 0 read events: 9236 CPU: 12 entries: 152 overrun: 0 commit overrun: 0 bytes: 7136 oldest event ts: 7013.582819 now ts: 7013.657099 dropped events: 0 read events: 1112 CPU: 13 entries: 86 overrun: 0 commit overrun: 0 bytes: 3928 oldest event ts: 7013.560591 now ts: 7013.657150 dropped events: 0 read events: 769 CPU: 14 entries: 85 overrun: 0 commit overrun: 0 bytes: 4076 oldest event ts: 7013.426020 now ts: 7013.657202 dropped events: 0 read events: 586 CPU: 15 entries: 211 overrun: 0 commit overrun: 0 bytes: 9568 oldest event ts: 7013.578705 now ts: 7013.657253 dropped events: 0 read events: 1578 CPU: 16 entries: 114 overrun: 0 commit overrun: 0 bytes: 7104 oldest event ts: 7013.626635 now ts: 7013.657304
Re: old version of trace-cmd broken on 3.10 kernel
On Fri, May 31, 2013 at 01:05:35PM -0400, Steven Rostedt wrote: On Fri, 2013-05-31 at 11:50 -0500, Shawn Bohrer wrote: Not sure if this is a big deal or not. I've got an old version of trace-cmd. It was built from git on 2012-09-12 but sadly I didn't stash away the exact commit hash. Anyway this version works fine on a 3.4 kernel but on a 3.10-rc3 kernel it no longer works. I just pulled the latest trace-cmd from git and it works fine on the 3.10 kernel so maybe this isn't an issue, but I don't typically expect applications to break with a kernel upgrade. When I run the old version on 3.10.0-rc3 I get the following output: Yep, in 3.10 I fixed a long standing bug in the splice code, that when fixed, the old trace-cmd would fail. I made a fix to all the stable releases of trace-cmd and posted it to LKML back on March 1st. https://lkml.org/lkml/2013/3/1/596 Thanks Steve, I'll just update my trace-cmd version. It is about time for an update anyway. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: deadlock on vmap_area_lock
On Thu, May 02, 2013 at 08:03:04AM +1000, Dave Chinner wrote: > On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote: > > On Wed, 1 May 2013, Shawn Bohrer wrote: > > > > > I've got two compute clusters with around 350 machines each which are > > > running kernels based off of 3.1.9 (Yes I realize this is ancient by > > > todays standards). > > xfs_info output of one of those filesystems? What platform are you > running (32 or 64 bit)? # cat /proc/mounts | grep data-cache /dev/sdb1 /data-cache xfs rw,nodiratime,relatime,attr2,delaylog,noquota 0 0 # xfs_info /data-cache meta-data=/dev/sdb1 isize=256agcount=4, agsize=66705344 blks = sectsz=512 attr=2 data = bsize=4096 blocks=266821376, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=130283, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 These are 64-bit systems. The ones that hit the issue more frequently have 96 GB of RAM. > > > All of the machines run a 'find' command once an > > > hour on one of the mounted XFS filesystems. Occasionally these find > > > commands get stuck requiring a reboot of the system. I took a peek > > > today and see this with perf: > > > > > > 72.22% find [kernel.kallsyms] [k] _raw_spin_lock > > > | > > > --- _raw_spin_lock > > >| > > >|--98.84%-- vm_map_ram > > >| _xfs_buf_map_pages > > >| xfs_buf_get > > >| xfs_buf_read > > >| xfs_trans_read_buf > > >| xfs_da_do_buf > > >| xfs_da_read_buf > > >| xfs_dir2_block_getdents > > >| xfs_readdir > > >| xfs_file_readdir > > >| vfs_readdir > > >| sys_getdents > > >| system_call_fastpath > > >| __getdents64 > > >| > > >|--1.12%-- _xfs_buf_map_pages > > >| xfs_buf_get > > >| xfs_buf_read > > >| xfs_trans_read_buf > > >| xfs_da_do_buf > > >| xfs_da_read_buf > > >| xfs_dir2_block_getdents > > >| xfs_readdir > > >| xfs_file_readdir > > >| vfs_readdir > > >| sys_getdents > > >| system_call_fastpath > > >| __getdents64 > > > --0.04%-- [...] > > > > > > Looking at the code my best guess is that we are spinning on > > > vmap_area_lock, but I could be wrong. This is the only process > > > spinning on the machine so I'm assuming either another process has > > > blocked while holding the lock, or perhaps this find process has tried > > > to acquire the vmap_area_lock twice? > > > > > > > Significant spinlock contention doesn't necessarily mean that there's a > > deadlock, but it also doesn't mean the opposite. Depending on your > > definition of "occassionally", would it be possible to run with > > CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real > > deadlock potential? > > It sure will. We've been reporting that vm_map_ram is doing > GFP_KERNEL allocations from GFP_NOFS context for years, and have > reported plenty of lockdep dumps as a result of it. > > But that's not the problem that is occurring above - lockstat is > probably a good thing to look at here to determine exactly what > locks are being contended on I've built a kernel with lock_stat, CONFIG_PROVE_LOCKING, CONFIG_LOCKDEP and have one machine running with that kernel. We'll probably put machines on this debug kernel when we reboot them and hop
Re: deadlock on vmap_area_lock
On Thu, May 02, 2013 at 08:03:04AM +1000, Dave Chinner wrote: On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote: On Wed, 1 May 2013, Shawn Bohrer wrote: I've got two compute clusters with around 350 machines each which are running kernels based off of 3.1.9 (Yes I realize this is ancient by todays standards). xfs_info output of one of those filesystems? What platform are you running (32 or 64 bit)? # cat /proc/mounts | grep data-cache /dev/sdb1 /data-cache xfs rw,nodiratime,relatime,attr2,delaylog,noquota 0 0 # xfs_info /data-cache meta-data=/dev/sdb1 isize=256agcount=4, agsize=66705344 blks = sectsz=512 attr=2 data = bsize=4096 blocks=266821376, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=130283, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 These are 64-bit systems. The ones that hit the issue more frequently have 96 GB of RAM. All of the machines run a 'find' command once an hour on one of the mounted XFS filesystems. Occasionally these find commands get stuck requiring a reboot of the system. I took a peek today and see this with perf: 72.22% find [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--98.84%-- vm_map_ram | _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 | |--1.12%-- _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 --0.04%-- [...] Looking at the code my best guess is that we are spinning on vmap_area_lock, but I could be wrong. This is the only process spinning on the machine so I'm assuming either another process has blocked while holding the lock, or perhaps this find process has tried to acquire the vmap_area_lock twice? Significant spinlock contention doesn't necessarily mean that there's a deadlock, but it also doesn't mean the opposite. Depending on your definition of occassionally, would it be possible to run with CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real deadlock potential? It sure will. We've been reporting that vm_map_ram is doing GFP_KERNEL allocations from GFP_NOFS context for years, and have reported plenty of lockdep dumps as a result of it. But that's not the problem that is occurring above - lockstat is probably a good thing to look at here to determine exactly what locks are being contended on I've built a kernel with lock_stat, CONFIG_PROVE_LOCKING, CONFIG_LOCKDEP and have one machine running with that kernel. We'll probably put machines on this debug kernel when we reboot them and hopefully one will trigger the issue. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: deadlock on vmap_area_lock
On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote: > On Wed, 1 May 2013, Shawn Bohrer wrote: > > > I've got two compute clusters with around 350 machines each which are > > running kernels based off of 3.1.9 (Yes I realize this is ancient by > > todays standards). All of the machines run a 'find' command once an > > hour on one of the mounted XFS filesystems. Occasionally these find > > commands get stuck requiring a reboot of the system. I took a peek > > today and see this with perf: > > > > 72.22% find [kernel.kallsyms] [k] _raw_spin_lock > > | > > --- _raw_spin_lock > >| > >|--98.84%-- vm_map_ram > >| _xfs_buf_map_pages > >| xfs_buf_get > >| xfs_buf_read > >| xfs_trans_read_buf > >| xfs_da_do_buf > >| xfs_da_read_buf > >| xfs_dir2_block_getdents > >| xfs_readdir > >| xfs_file_readdir > >| vfs_readdir > >| sys_getdents > >| system_call_fastpath > >| __getdents64 > >| > >|--1.12%-- _xfs_buf_map_pages > >| xfs_buf_get > >| xfs_buf_read > >| xfs_trans_read_buf > >| xfs_da_do_buf > >| xfs_da_read_buf > >| xfs_dir2_block_getdents > >| xfs_readdir > >| xfs_file_readdir > >| vfs_readdir > >| sys_getdents > >| system_call_fastpath > >| __getdents64 > > --0.04%-- [...] > > > > Looking at the code my best guess is that we are spinning on > > vmap_area_lock, but I could be wrong. This is the only process > > spinning on the machine so I'm assuming either another process has > > blocked while holding the lock, or perhaps this find process has tried > > to acquire the vmap_area_lock twice? > > > > Significant spinlock contention doesn't necessarily mean that there's a > deadlock, but it also doesn't mean the opposite. Correct it doesn't and I can't prove the find command is not making progress, however these finds normally complete in under 15 min and we've let the stuck ones run for days. Additionally if this was just contention I'd expect to see multiple threads/CPUs contending and I only have a single CPU pegged running find at 99%. I should clarify that the perf snippet above was for the entire system. Profiling just the find command shows: 82.56% find [kernel.kallsyms] [k] _raw_spin_lock 16.63% find [kernel.kallsyms] [k] vm_map_ram 0.13% find [kernel.kallsyms] [k] hrtimer_interrupt 0.04% find [kernel.kallsyms] [k] update_curr 0.03% find [igb] [k] igb_poll 0.03% find [kernel.kallsyms] [k] irqtime_account_process_tick 0.03% find [kernel.kallsyms] [k] account_system_vtime 0.03% find [kernel.kallsyms] [k] task_tick_fair 0.03% find [kernel.kallsyms] [k] perf_event_task_tick 0.03% find [kernel.kallsyms] [k] scheduler_tick 0.03% find [kernel.kallsyms] [k] rb_erase 0.02% find [kernel.kallsyms] [k] native_write_msr_safe 0.02% find [kernel.kallsyms] [k] native_sched_clock 0.02% find [kernel.kallsyms] [k] dma_issue_pending_all 0.02% find [kernel.kallsyms] [k] handle_irq_event_percpu 0.02% find [kernel.kallsyms] [k] timerqueue_del 0.02% find [kernel.kallsyms] [k] run_timer_softirq 0.02% find [kernel.kallsyms] [k] get_mm_counter 0.02% find [kernel.kallsyms] [k] __rcu_pending 0.02% find [kernel.kallsyms] [k] tick_program_event 0.01% find [kernel.kallsyms] [k] __netif_receive_skb 0.01% find [kernel.kallsyms] [k] ip_route_input_common 0.01% find [kernel.kallsyms] [k] __insert_vmap_area 0.01% find [igb] [k] igb_alloc_rx_buffers_adv 0.01% find [kernel.kallsyms] [k] irq_exit 0.01% find [kern
deadlock on vmap_area_lock
I've got two compute clusters with around 350 machines each which are running kernels based off of 3.1.9 (Yes I realize this is ancient by todays standards). All of the machines run a 'find' command once an hour on one of the mounted XFS filesystems. Occasionally these find commands get stuck requiring a reboot of the system. I took a peek today and see this with perf: 72.22% find [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--98.84%-- vm_map_ram | _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 | |--1.12%-- _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 --0.04%-- [...] Looking at the code my best guess is that we are spinning on vmap_area_lock, but I could be wrong. This is the only process spinning on the machine so I'm assuming either another process has blocked while holding the lock, or perhaps this find process has tried to acquire the vmap_area_lock twice? I've skimmed through the change logs between 3.1 and 3.9 but nothing stood out as fix for this bug. Does this ring a bell with anyone? If I have a machine that is currently in one of these stuck states does anyone have any tips to identifying the processes currently holding the lock? Additionally as I mentioned before I have two clusters of roughly equal size though one cluster hits this issue more frequently. On that cluster with approximately 350 machines we get about 10 stuck machines a month. The other cluster has about 450 machines but we only get about 1 or 2 stuck machines a month. Both clusters run the same find command every hour, but the workloads on the machines are different. The cluster that hits the issue more frequently tends to run more memory intensive jobs. I'm open to building some debug kernels to help track this down, though I can't upgrade all of the machines in one shot so it may take a while to reproduce. I'm happy to provide any other information if people have questions. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
deadlock on vmap_area_lock
I've got two compute clusters with around 350 machines each which are running kernels based off of 3.1.9 (Yes I realize this is ancient by todays standards). All of the machines run a 'find' command once an hour on one of the mounted XFS filesystems. Occasionally these find commands get stuck requiring a reboot of the system. I took a peek today and see this with perf: 72.22% find [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--98.84%-- vm_map_ram | _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 | |--1.12%-- _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 --0.04%-- [...] Looking at the code my best guess is that we are spinning on vmap_area_lock, but I could be wrong. This is the only process spinning on the machine so I'm assuming either another process has blocked while holding the lock, or perhaps this find process has tried to acquire the vmap_area_lock twice? I've skimmed through the change logs between 3.1 and 3.9 but nothing stood out as fix for this bug. Does this ring a bell with anyone? If I have a machine that is currently in one of these stuck states does anyone have any tips to identifying the processes currently holding the lock? Additionally as I mentioned before I have two clusters of roughly equal size though one cluster hits this issue more frequently. On that cluster with approximately 350 machines we get about 10 stuck machines a month. The other cluster has about 450 machines but we only get about 1 or 2 stuck machines a month. Both clusters run the same find command every hour, but the workloads on the machines are different. The cluster that hits the issue more frequently tends to run more memory intensive jobs. I'm open to building some debug kernels to help track this down, though I can't upgrade all of the machines in one shot so it may take a while to reproduce. I'm happy to provide any other information if people have questions. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: deadlock on vmap_area_lock
On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote: On Wed, 1 May 2013, Shawn Bohrer wrote: I've got two compute clusters with around 350 machines each which are running kernels based off of 3.1.9 (Yes I realize this is ancient by todays standards). All of the machines run a 'find' command once an hour on one of the mounted XFS filesystems. Occasionally these find commands get stuck requiring a reboot of the system. I took a peek today and see this with perf: 72.22% find [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--98.84%-- vm_map_ram | _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 | |--1.12%-- _xfs_buf_map_pages | xfs_buf_get | xfs_buf_read | xfs_trans_read_buf | xfs_da_do_buf | xfs_da_read_buf | xfs_dir2_block_getdents | xfs_readdir | xfs_file_readdir | vfs_readdir | sys_getdents | system_call_fastpath | __getdents64 --0.04%-- [...] Looking at the code my best guess is that we are spinning on vmap_area_lock, but I could be wrong. This is the only process spinning on the machine so I'm assuming either another process has blocked while holding the lock, or perhaps this find process has tried to acquire the vmap_area_lock twice? Significant spinlock contention doesn't necessarily mean that there's a deadlock, but it also doesn't mean the opposite. Correct it doesn't and I can't prove the find command is not making progress, however these finds normally complete in under 15 min and we've let the stuck ones run for days. Additionally if this was just contention I'd expect to see multiple threads/CPUs contending and I only have a single CPU pegged running find at 99%. I should clarify that the perf snippet above was for the entire system. Profiling just the find command shows: 82.56% find [kernel.kallsyms] [k] _raw_spin_lock 16.63% find [kernel.kallsyms] [k] vm_map_ram 0.13% find [kernel.kallsyms] [k] hrtimer_interrupt 0.04% find [kernel.kallsyms] [k] update_curr 0.03% find [igb] [k] igb_poll 0.03% find [kernel.kallsyms] [k] irqtime_account_process_tick 0.03% find [kernel.kallsyms] [k] account_system_vtime 0.03% find [kernel.kallsyms] [k] task_tick_fair 0.03% find [kernel.kallsyms] [k] perf_event_task_tick 0.03% find [kernel.kallsyms] [k] scheduler_tick 0.03% find [kernel.kallsyms] [k] rb_erase 0.02% find [kernel.kallsyms] [k] native_write_msr_safe 0.02% find [kernel.kallsyms] [k] native_sched_clock 0.02% find [kernel.kallsyms] [k] dma_issue_pending_all 0.02% find [kernel.kallsyms] [k] handle_irq_event_percpu 0.02% find [kernel.kallsyms] [k] timerqueue_del 0.02% find [kernel.kallsyms] [k] run_timer_softirq 0.02% find [kernel.kallsyms] [k] get_mm_counter 0.02% find [kernel.kallsyms] [k] __rcu_pending 0.02% find [kernel.kallsyms] [k] tick_program_event 0.01% find [kernel.kallsyms] [k] __netif_receive_skb 0.01% find [kernel.kallsyms] [k] ip_route_input_common 0.01% find [kernel.kallsyms] [k] __insert_vmap_area 0.01% find [igb] [k] igb_alloc_rx_buffers_adv 0.01% find [kernel.kallsyms] [k] irq_exit 0.01% find [kernel.kallsyms] [k] acct_update_integrals 0.01% find [kernel.kallsyms] [k] apic_timer_interrupt 0.01% find [kernel.kallsyms] [k] tick_sched_timer 0.01% find [kernel.kallsyms] [k] __remove_hrtimer 0.01% find [kernel.kallsyms] [k] do_IRQ 0.01% find [kernel.kallsyms] [k] dev_gro_receive 0.01% find [kernel.kallsyms] [k] net_rx_action 0.01
Re: 3.7 HDMI channel map regression
On Sun, Feb 17, 2013 at 09:34:53AM +0100, Takashi Iwai wrote: > At Sat, 16 Feb 2013 18:22:25 -0600, > Shawn Bohrer wrote: > > > > On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote: > > > On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: > > > > At Sun, 27 Jan 2013 19:18:27 -0600, > > > > Shawn Bohrer wrote: > > > > > > > > > > Hi Takashi, > > > > > > > > > > I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL > > > > > and FC channels to swap, and my RR and LFE channels to swap for PCM > > > > > audio. Doing a git bisect identified > > > > > d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the > > > > > proper channel mapping for generic HDMI driver" as the commit that > > > > > caused my channels to swap. The commit doesn't revert cleanly on > > > > > 3.7.4, and I haven't really looked to see what the correct fix might > > > > > be. > > > > > > > > > > Some info that may be relevant, the sound card is a: > > > > > > > > > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset > > > > > Family High Definition Audio Controller (rev 04) > > > > > > > > > > The machine is running Fedora 18 and audio goes over HDMI to a 5.1 > > > > > receiver. I'm not really sure what other info you might need, but > > > > > let me know if you need something else or have any patches you would > > > > > like me to test. > > > > > > > > OK, it's the first time to get a bug report about this. > > > > Could you tell me how did you test it (i.e. which application, which > > > > sound backend)? Can you confirm that it's reproduced via speaker-test > > > > program in alsa-utils package? > > > > > > I originally noticed the problem when all of the dialog started coming > > > out of my rear left speaker in MythTV after the kernel update. Then I > > > started using the Gnome 3 sound configuration gui in the system > > > settings which has a speaker test and I assume is using pulseaudio. > > > Running 'speaker-test -c6 -l1 -twav' also reproduces the problem. > > > > > > For reference here are the versions of the various packages that I'm > > > running: > > > > > > alsa-utils-1.0.26-1.fc18.x86_64 > > > alsa-firmware-1.0.25-2.fc18.noarch > > > alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64 > > > alsa-lib-devel-1.0.26-2.fc18.x86_64 > > > alsa-lib-1.0.26-2.fc18.x86_64 > > > alsa-tools-firmware-1.0.26.1-1.fc18.x86_64 > > > pulseaudio-gdm-hooks-2.1-5.fc18.x86_64 > > > pulseaudio-libs-2.1-5.fc18.x86_64 > > > pulseaudio-libs-glib2-2.1-5.fc18.x86_64 > > > pulseaudio-module-x11-2.1-5.fc18.x86_64 > > > pulseaudio-module-bluetooth-2.1-5.fc18.x86_64 > > > pulseaudio-2.1-5.fc18.x86_64 > > > pulseaudio-utils-2.1-5.fc18.x86_64 > > > > Hi Takashi, > > > > Any updates on this issue? I'd really like to see this issue fixed > > and am happy to help in any way I can. Until this gets fixed I'm > > stuck on a 3.6.* kernel. > > There is one fix in sound git tree regarding the HDMI channel map, but > it's queued for 3.9 kernel (then backported to stable tree). > Try sound.git tree or wait for a while until the upstream merge > process above is done. Thanks Takashi, I just tested the sound.git master branch and this does indeed fix my issue. I'll look forward to this going into 3.9. -- Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7 HDMI channel map regression
On Sun, Feb 17, 2013 at 09:34:53AM +0100, Takashi Iwai wrote: At Sat, 16 Feb 2013 18:22:25 -0600, Shawn Bohrer wrote: On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote: On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: At Sun, 27 Jan 2013 19:18:27 -0600, Shawn Bohrer wrote: Hi Takashi, I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL and FC channels to swap, and my RR and LFE channels to swap for PCM audio. Doing a git bisect identified d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the proper channel mapping for generic HDMI driver as the commit that caused my channels to swap. The commit doesn't revert cleanly on 3.7.4, and I haven't really looked to see what the correct fix might be. Some info that may be relevant, the sound card is a: 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) The machine is running Fedora 18 and audio goes over HDMI to a 5.1 receiver. I'm not really sure what other info you might need, but let me know if you need something else or have any patches you would like me to test. OK, it's the first time to get a bug report about this. Could you tell me how did you test it (i.e. which application, which sound backend)? Can you confirm that it's reproduced via speaker-test program in alsa-utils package? I originally noticed the problem when all of the dialog started coming out of my rear left speaker in MythTV after the kernel update. Then I started using the Gnome 3 sound configuration gui in the system settings which has a speaker test and I assume is using pulseaudio. Running 'speaker-test -c6 -l1 -twav' also reproduces the problem. For reference here are the versions of the various packages that I'm running: alsa-utils-1.0.26-1.fc18.x86_64 alsa-firmware-1.0.25-2.fc18.noarch alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64 alsa-lib-devel-1.0.26-2.fc18.x86_64 alsa-lib-1.0.26-2.fc18.x86_64 alsa-tools-firmware-1.0.26.1-1.fc18.x86_64 pulseaudio-gdm-hooks-2.1-5.fc18.x86_64 pulseaudio-libs-2.1-5.fc18.x86_64 pulseaudio-libs-glib2-2.1-5.fc18.x86_64 pulseaudio-module-x11-2.1-5.fc18.x86_64 pulseaudio-module-bluetooth-2.1-5.fc18.x86_64 pulseaudio-2.1-5.fc18.x86_64 pulseaudio-utils-2.1-5.fc18.x86_64 Hi Takashi, Any updates on this issue? I'd really like to see this issue fixed and am happy to help in any way I can. Until this gets fixed I'm stuck on a 3.6.* kernel. There is one fix in sound git tree regarding the HDMI channel map, but it's queued for 3.9 kernel (then backported to stable tree). Try sound.git tree or wait for a while until the upstream merge process above is done. Thanks Takashi, I just tested the sound.git master branch and this does indeed fix my issue. I'll look forward to this going into 3.9. -- Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7 HDMI channel map regression
On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote: > On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: > > At Sun, 27 Jan 2013 19:18:27 -0600, > > Shawn Bohrer wrote: > > > > > > Hi Takashi, > > > > > > I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL > > > and FC channels to swap, and my RR and LFE channels to swap for PCM > > > audio. Doing a git bisect identified > > > d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the > > > proper channel mapping for generic HDMI driver" as the commit that > > > caused my channels to swap. The commit doesn't revert cleanly on > > > 3.7.4, and I haven't really looked to see what the correct fix might > > > be. > > > > > > Some info that may be relevant, the sound card is a: > > > > > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset > > > Family High Definition Audio Controller (rev 04) > > > > > > The machine is running Fedora 18 and audio goes over HDMI to a 5.1 > > > receiver. I'm not really sure what other info you might need, but > > > let me know if you need something else or have any patches you would > > > like me to test. > > > > OK, it's the first time to get a bug report about this. > > Could you tell me how did you test it (i.e. which application, which > > sound backend)? Can you confirm that it's reproduced via speaker-test > > program in alsa-utils package? > > I originally noticed the problem when all of the dialog started coming > out of my rear left speaker in MythTV after the kernel update. Then I > started using the Gnome 3 sound configuration gui in the system > settings which has a speaker test and I assume is using pulseaudio. > Running 'speaker-test -c6 -l1 -twav' also reproduces the problem. > > For reference here are the versions of the various packages that I'm > running: > > alsa-utils-1.0.26-1.fc18.x86_64 > alsa-firmware-1.0.25-2.fc18.noarch > alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64 > alsa-lib-devel-1.0.26-2.fc18.x86_64 > alsa-lib-1.0.26-2.fc18.x86_64 > alsa-tools-firmware-1.0.26.1-1.fc18.x86_64 > pulseaudio-gdm-hooks-2.1-5.fc18.x86_64 > pulseaudio-libs-2.1-5.fc18.x86_64 > pulseaudio-libs-glib2-2.1-5.fc18.x86_64 > pulseaudio-module-x11-2.1-5.fc18.x86_64 > pulseaudio-module-bluetooth-2.1-5.fc18.x86_64 > pulseaudio-2.1-5.fc18.x86_64 > pulseaudio-utils-2.1-5.fc18.x86_64 Hi Takashi, Any updates on this issue? I'd really like to see this issue fixed and am happy to help in any way I can. Until this gets fixed I'm stuck on a 3.6.* kernel. Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7 HDMI channel map regression
On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote: On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: At Sun, 27 Jan 2013 19:18:27 -0600, Shawn Bohrer wrote: Hi Takashi, I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL and FC channels to swap, and my RR and LFE channels to swap for PCM audio. Doing a git bisect identified d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the proper channel mapping for generic HDMI driver as the commit that caused my channels to swap. The commit doesn't revert cleanly on 3.7.4, and I haven't really looked to see what the correct fix might be. Some info that may be relevant, the sound card is a: 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) The machine is running Fedora 18 and audio goes over HDMI to a 5.1 receiver. I'm not really sure what other info you might need, but let me know if you need something else or have any patches you would like me to test. OK, it's the first time to get a bug report about this. Could you tell me how did you test it (i.e. which application, which sound backend)? Can you confirm that it's reproduced via speaker-test program in alsa-utils package? I originally noticed the problem when all of the dialog started coming out of my rear left speaker in MythTV after the kernel update. Then I started using the Gnome 3 sound configuration gui in the system settings which has a speaker test and I assume is using pulseaudio. Running 'speaker-test -c6 -l1 -twav' also reproduces the problem. For reference here are the versions of the various packages that I'm running: alsa-utils-1.0.26-1.fc18.x86_64 alsa-firmware-1.0.25-2.fc18.noarch alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64 alsa-lib-devel-1.0.26-2.fc18.x86_64 alsa-lib-1.0.26-2.fc18.x86_64 alsa-tools-firmware-1.0.26.1-1.fc18.x86_64 pulseaudio-gdm-hooks-2.1-5.fc18.x86_64 pulseaudio-libs-2.1-5.fc18.x86_64 pulseaudio-libs-glib2-2.1-5.fc18.x86_64 pulseaudio-module-x11-2.1-5.fc18.x86_64 pulseaudio-module-bluetooth-2.1-5.fc18.x86_64 pulseaudio-2.1-5.fc18.x86_64 pulseaudio-utils-2.1-5.fc18.x86_64 Hi Takashi, Any updates on this issue? I'd really like to see this issue fixed and am happy to help in any way I can. Until this gets fixed I'm stuck on a 3.6.* kernel. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7 HDMI channel map regression
On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: > At Sun, 27 Jan 2013 19:18:27 -0600, > Shawn Bohrer wrote: > > > > Hi Takashi, > > > > I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL > > and FC channels to swap, and my RR and LFE channels to swap for PCM > > audio. Doing a git bisect identified > > d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the > > proper channel mapping for generic HDMI driver" as the commit that > > caused my channels to swap. The commit doesn't revert cleanly on > > 3.7.4, and I haven't really looked to see what the correct fix might > > be. > > > > Some info that may be relevant, the sound card is a: > > > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset > > Family High Definition Audio Controller (rev 04) > > > > The machine is running Fedora 18 and audio goes over HDMI to a 5.1 > > receiver. I'm not really sure what other info you might need, but > > let me know if you need something else or have any patches you would > > like me to test. > > OK, it's the first time to get a bug report about this. > Could you tell me how did you test it (i.e. which application, which > sound backend)? Can you confirm that it's reproduced via speaker-test > program in alsa-utils package? I originally noticed the problem when all of the dialog started coming out of my rear left speaker in MythTV after the kernel update. Then I started using the Gnome 3 sound configuration gui in the system settings which has a speaker test and I assume is using pulseaudio. Running 'speaker-test -c6 -l1 -twav' also reproduces the problem. For reference here are the versions of the various packages that I'm running: alsa-utils-1.0.26-1.fc18.x86_64 alsa-firmware-1.0.25-2.fc18.noarch alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64 alsa-lib-devel-1.0.26-2.fc18.x86_64 alsa-lib-1.0.26-2.fc18.x86_64 alsa-tools-firmware-1.0.26.1-1.fc18.x86_64 pulseaudio-gdm-hooks-2.1-5.fc18.x86_64 pulseaudio-libs-2.1-5.fc18.x86_64 pulseaudio-libs-glib2-2.1-5.fc18.x86_64 pulseaudio-module-x11-2.1-5.fc18.x86_64 pulseaudio-module-bluetooth-2.1-5.fc18.x86_64 pulseaudio-2.1-5.fc18.x86_64 pulseaudio-utils-2.1-5.fc18.x86_64 > For further debugging, please give the following: > - alsa-info.sh output while playing 5.1 sound upload=true=true= !! !!ALSA Information Script v 0.4.60 !! !!Script ran on: Tue Jan 29 02:39:13 UTC 2013 !!Linux Distribution !!-- Fedora release 18 (Spherical Cow) Fedora release 18 (Spherical Cow) NAME=Fedora ID=fedora PRETTY_NAME="Fedora 18 (Spherical Cow)" CPE_NAME="cpe:/o:fedoraproject:fedora:18" Fedora release 18 (Spherical Cow) Fedora release 18 (Spherical Cow) !!DMI Information !!--- Manufacturer: To Be Filled By O.E.M. Product Name: To Be Filled By O.E.M. Product Version: To Be Filled By O.E.M. !!Kernel Information !!-- Kernel release:3.7.2-204.fc18.x86_64 Operating System: GNU/Linux Architecture: x86_64 Processor: x86_64 SMP Enabled: Yes !!ALSA Version !! Driver version: k3.7.2-204.fc18.x86_64 Library version:1.0.26 Utilities version: 1.0.26 !!Loaded ALSA modules !!--- snd_hda_intel !!Sound Servers on this system !! Pulseaudio: Installed - Yes (/usr/bin/pulseaudio) Running - Yes Jack: Installed - Yes (/usr/bin/jackd) Running - No !!Soundcards recognised by ALSA !!- 0 [PCH]: HDA-Intel - HDA Intel PCH HDA Intel PCH at 0xf7d1 irq 46 !!PCI Soundcards installed in the system !!-- 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) !!Advanced information - PCI Vendor/Device/Subsystem ID's !! 00:1b.0 0403: 8086:1e20 (rev 04) Subsystem: 1849:1898 !!Loaded sound module options !!-- !!Module: snd_hda_intel align_buffer_size : -1 bdl_pos_adj : 1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1 beep_mode : N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N enable : Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y enable_msi : -1 id : (null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null) index : -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
Re: 3.7 HDMI channel map regression
On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote: At Sun, 27 Jan 2013 19:18:27 -0600, Shawn Bohrer wrote: Hi Takashi, I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL and FC channels to swap, and my RR and LFE channels to swap for PCM audio. Doing a git bisect identified d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the proper channel mapping for generic HDMI driver as the commit that caused my channels to swap. The commit doesn't revert cleanly on 3.7.4, and I haven't really looked to see what the correct fix might be. Some info that may be relevant, the sound card is a: 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) The machine is running Fedora 18 and audio goes over HDMI to a 5.1 receiver. I'm not really sure what other info you might need, but let me know if you need something else or have any patches you would like me to test. OK, it's the first time to get a bug report about this. Could you tell me how did you test it (i.e. which application, which sound backend)? Can you confirm that it's reproduced via speaker-test program in alsa-utils package? I originally noticed the problem when all of the dialog started coming out of my rear left speaker in MythTV after the kernel update. Then I started using the Gnome 3 sound configuration gui in the system settings which has a speaker test and I assume is using pulseaudio. Running 'speaker-test -c6 -l1 -twav' also reproduces the problem. For reference here are the versions of the various packages that I'm running: alsa-utils-1.0.26-1.fc18.x86_64 alsa-firmware-1.0.25-2.fc18.noarch alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64 alsa-lib-devel-1.0.26-2.fc18.x86_64 alsa-lib-1.0.26-2.fc18.x86_64 alsa-tools-firmware-1.0.26.1-1.fc18.x86_64 pulseaudio-gdm-hooks-2.1-5.fc18.x86_64 pulseaudio-libs-2.1-5.fc18.x86_64 pulseaudio-libs-glib2-2.1-5.fc18.x86_64 pulseaudio-module-x11-2.1-5.fc18.x86_64 pulseaudio-module-bluetooth-2.1-5.fc18.x86_64 pulseaudio-2.1-5.fc18.x86_64 pulseaudio-utils-2.1-5.fc18.x86_64 For further debugging, please give the following: - alsa-info.sh output while playing 5.1 sound upload=truescript=truecardinfo= !! !!ALSA Information Script v 0.4.60 !! !!Script ran on: Tue Jan 29 02:39:13 UTC 2013 !!Linux Distribution !!-- Fedora release 18 (Spherical Cow) Fedora release 18 (Spherical Cow) NAME=Fedora ID=fedora PRETTY_NAME=Fedora 18 (Spherical Cow) CPE_NAME=cpe:/o:fedoraproject:fedora:18 Fedora release 18 (Spherical Cow) Fedora release 18 (Spherical Cow) !!DMI Information !!--- Manufacturer: To Be Filled By O.E.M. Product Name: To Be Filled By O.E.M. Product Version: To Be Filled By O.E.M. !!Kernel Information !!-- Kernel release:3.7.2-204.fc18.x86_64 Operating System: GNU/Linux Architecture: x86_64 Processor: x86_64 SMP Enabled: Yes !!ALSA Version !! Driver version: k3.7.2-204.fc18.x86_64 Library version:1.0.26 Utilities version: 1.0.26 !!Loaded ALSA modules !!--- snd_hda_intel !!Sound Servers on this system !! Pulseaudio: Installed - Yes (/usr/bin/pulseaudio) Running - Yes Jack: Installed - Yes (/usr/bin/jackd) Running - No !!Soundcards recognised by ALSA !!- 0 [PCH]: HDA-Intel - HDA Intel PCH HDA Intel PCH at 0xf7d1 irq 46 !!PCI Soundcards installed in the system !!-- 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) !!Advanced information - PCI Vendor/Device/Subsystem ID's !! 00:1b.0 0403: 8086:1e20 (rev 04) Subsystem: 1849:1898 !!Loaded sound module options !!-- !!Module: snd_hda_intel align_buffer_size : -1 bdl_pos_adj : 1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1 beep_mode : N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N enable : Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y enable_msi : -1 id : (null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null) index : -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1 model : (null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null
3.7 HDMI channel map regression
Hi Takashi, I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL and FC channels to swap, and my RR and LFE channels to swap for PCM audio. Doing a git bisect identified d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the proper channel mapping for generic HDMI driver" as the commit that caused my channels to swap. The commit doesn't revert cleanly on 3.7.4, and I haven't really looked to see what the correct fix might be. Some info that may be relevant, the sound card is a: 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) The machine is running Fedora 18 and audio goes over HDMI to a 5.1 receiver. I'm not really sure what other info you might need, but let me know if you need something else or have any patches you would like me to test. Thanks, Shawn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.7 HDMI channel map regression
Hi Takashi, I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL and FC channels to swap, and my RR and LFE channels to swap for PCM audio. Doing a git bisect identified d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the proper channel mapping for generic HDMI driver as the commit that caused my channels to swap. The commit doesn't revert cleanly on 3.7.4, and I haven't really looked to see what the correct fix might be. Some info that may be relevant, the sound card is a: 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04) The machine is running Fedora 18 and audio goes over HDMI to a 5.1 receiver. I'm not really sure what other info you might need, but let me know if you need something else or have any patches you would like me to test. Thanks, Shawn -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched_rt: Use root_domain of rt_rq not current processor
When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Signed-off-by: Shawn Bohrer --- kernel/sched/rt.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 418feb0..4f02b28 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -566,7 +566,7 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq) static int do_balance_runtime(struct rt_rq *rt_rq) { struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); - struct root_domain *rd = cpu_rq(smp_processor_id())->rd; + struct root_domain *rd = rq_of_rt_rq(rt_rq)->rd; int i, weight, more = 0; u64 rt_period; -- 1.7.7.6 -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched_rt: Use root_domain of rt_rq not current processor
When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com --- kernel/sched/rt.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 418feb0..4f02b28 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -566,7 +566,7 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq) static int do_balance_runtime(struct rt_rq *rt_rq) { struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); - struct root_domain *rd = cpu_rq(smp_processor_id())-rd; + struct root_domain *rd = rq_of_rt_rq(rt_rq)-rd; int i, weight, more = 0; u64 rt_period; -- 1.7.7.6 -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote: > On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: > > On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: > > > > > > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > > > > is still present in the latest kernel. > > > > > > Shawn, > > > > > > Can you send me your .config file. > > > > I've attached the 3.8.0-rc2 config that I used to reproduce this in an > > 8 core kvm image. Let me know if you need anything else. > > I tried beating on my little Q6600 with no success. I even tried > setting the entire box rt, GUI and all, nada. > > Hm, maybe re-installing systemd.. I don't know if Steve has had any success. I can reproduce this easily now so I'm happy to do some debugging if anyone has some things they want me to try. Here is some info on my setup at the moment. I'm using an 8 core KVM image now with an xfs file system. We do use systemd if that is relevant. My cpuset controller is mounted on /cgroup/cpuset and we use libcgroup-tools to move everything on the system that can be moved into /cgroup/cpuset/sysdefault/ I've also boosted all kworker threads to run as SCHED_FIFO with a priority of 51. From there I just drop the three attached shell scripts (burn.sh, sched_domain_bug.sh and sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as root. Usually the bug triggers in less than a minute. You may need to tweak my shell scripts if your setup is different but they are very rudimentary. In order to try digging up some more info I applied the following patch, and triggered the bug a few times. The results are always essentially the same: --- kernel/sched/rt.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 418feb0..fba7f01 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq) * we lend and now have to reclaim. */ want = rt_b->rt_runtime - rt_rq->rt_runtime; + printk(KERN_INFO "Initial want: %lld rt_b->rt_runtime: %llu rt_rq->rt_runtime: %llu\n", + want, rt_b->rt_runtime, rt_rq->rt_runtime); /* * Greedy reclaim, take back as much as we can. @@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + if (want) { + printk(KERN_ERR "BUG triggered, want: %lld\n", want); + for_each_cpu(i, rd->span) { + print_rt_stats(NULL, i); + } + } balanced: /* * Disable all the borrow logic by pretending we have inf --- Here is the output: [ 81.278842] SysRq : Changing Loglevel [ 81.279027] Loglevel set to 9 [ 83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 87.287435] Initial want: 1 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 85000 [ 87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 10 [ 89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 8 [ 89.292940] Initial want: 1 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 85000 [ 89.294082] Initial want: 1 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 85000 [ 89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 90.959004] [sched_delayed] sched: RT throttling activated [ 91.289470] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.292767] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.294037] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.295364] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.296355] BUG triggered, want: 2 [ 91.296355] [ 91.296355] rt_rq[7]: [ 91.296355] .rt_nr_running : 0 [ 91.29635
Re: kernel BUG at kernel/sched_rt.c:493!
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote: On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Shawn, Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce this in an 8 core kvm image. Let me know if you need anything else. I tried beating on my little Q6600 with no success. I even tried setting the entire box rt, GUI and all, nada. Hm, maybe re-installing systemd.. I don't know if Steve has had any success. I can reproduce this easily now so I'm happy to do some debugging if anyone has some things they want me to try. Here is some info on my setup at the moment. I'm using an 8 core KVM image now with an xfs file system. We do use systemd if that is relevant. My cpuset controller is mounted on /cgroup/cpuset and we use libcgroup-tools to move everything on the system that can be moved into /cgroup/cpuset/sysdefault/ I've also boosted all kworker threads to run as SCHED_FIFO with a priority of 51. From there I just drop the three attached shell scripts (burn.sh, sched_domain_bug.sh and sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as root. Usually the bug triggers in less than a minute. You may need to tweak my shell scripts if your setup is different but they are very rudimentary. In order to try digging up some more info I applied the following patch, and triggered the bug a few times. The results are always essentially the same: --- kernel/sched/rt.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 418feb0..fba7f01 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq) * we lend and now have to reclaim. */ want = rt_b-rt_runtime - rt_rq-rt_runtime; + printk(KERN_INFO Initial want: %lld rt_b-rt_runtime: %llu rt_rq-rt_runtime: %llu\n, + want, rt_b-rt_runtime, rt_rq-rt_runtime); /* * Greedy reclaim, take back as much as we can. @@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + if (want) { + printk(KERN_ERR BUG triggered, want: %lld\n, want); + for_each_cpu(i, rd-span) { + print_rt_stats(NULL, i); + } + } balanced: /* * Disable all the borrow logic by pretending we have inf --- Here is the output: [ 81.278842] SysRq : Changing Loglevel [ 81.279027] Loglevel set to 9 [ 83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 87.287435] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 8 [ 89.292940] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.294082] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 90.959004] [sched_delayed] sched: RT throttling activated [ 91.289470] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.292767] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.294037] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.295364] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.296355] BUG triggered, want: 2 [ 91.296355] [ 91.296355] rt_rq[7]: [ 91.296355] .rt_nr_running : 0 [ 91.296355] .rt_throttled : 0 [ 91.296355] .rt_time : 0.00 [ 91.296355] .rt_runtime: 750.00 [ 91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 91.308440] Initial want: -1 rt_b-rt_runtime: 95000 rt_rq
Re: kernel BUG at kernel/sched_rt.c:493!
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: > > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > > is still present in the latest kernel. > > Shawn, > > Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce this in an 8 core kvm image. Let me know if you need anything else. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. 3.8.0-rc2.config.gz Description: GNU Zip compressed data
Re: kernel BUG at kernel/sched_rt.c:493!
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Shawn, Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce this in an 8 core kvm image. Let me know if you need anything else. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. 3.8.0-rc2.config.gz Description: GNU Zip compressed data
Re: kernel BUG at kernel/sched_rt.c:493!
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote: > On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: > > I've tried reproducing the issue, but so far I've been unsuccessful > > but I believe that is because my RT tasks aren't using enough CPU > > cause borrowing from the other runqueues. Normally our RT tasks use > > very little CPU so I'm not entirely sure what conditions caused them > > to run into throttling on the day that this happened. > > I've managed to reproduce this a couple times now on 3.1.9 I'll give > this a try later with a more recent kernel. Here is what I've done to > reproduce the issue. > > > # Setup in shell 1 > root@berbox39:/cgroup/cpuset# mkdir package0 > root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems > root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus > root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance > 1 > root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance > 1 > root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance > 1 > root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus > root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems > root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks > > # Setup in shell 2 > root@berbox39:~# cd /cgroup/cpuset/ > root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ > root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks > > # In shell 1 > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & > > # In shell 2 > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance > > I haven't found the exact magic combination but I've been going back > and forth adding/killing burn.sh processes and toggling > cpuset.sched_load_balance and in a couple of minutes I can usually get > the machine to trigger the bug. I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Also just re-reading my instructions above /root/burn.sh is just a simple: while true; do : ; done I've also had to make the kworker threads SCHED_FIFO with a higher priority than burn.sh or as expected I can lock up the system due to some xfs threads getting starved. Let me know if anyone needs any more information, or needs me to try anything since it appears I can trigger this fairly easily now. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: > I've tried reproducing the issue, but so far I've been unsuccessful > but I believe that is because my RT tasks aren't using enough CPU > cause borrowing from the other runqueues. Normally our RT tasks use > very little CPU so I'm not entirely sure what conditions caused them > to run into throttling on the day that this happened. I've managed to reproduce this a couple times now on 3.1.9 I'll give this a try later with a more recent kernel. Here is what I've done to reproduce the issue. # Setup in shell 1 root@berbox39:/cgroup/cpuset# mkdir package0 root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks # Setup in shell 2 root@berbox39:~# cd /cgroup/cpuset/ root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks # In shell 1 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & # In shell 2 root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance I haven't found the exact magic combination but I've been going back and forth adding/killing burn.sh processes and toggling cpuset.sched_load_balance and in a couple of minutes I can usually get the machine to trigger the bug. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. I've managed to reproduce this a couple times now on 3.1.9 I'll give this a try later with a more recent kernel. Here is what I've done to reproduce the issue. # Setup in shell 1 root@berbox39:/cgroup/cpuset# mkdir package0 root@berbox39:/cgroup/cpuset# echo 0 package0/cpuset.mems root@berbox39:/cgroup/cpuset# echo 0,2,4,6 package0/cpuset.cpus root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# echo 1,3,5,7 sysdefault/cpuset.cpus root@berbox39:/cgroup/cpuset# echo 0 sysdefault/cpuset.mems root@berbox39:/cgroup/cpuset# echo $$ package0/tasks # Setup in shell 2 root@berbox39:~# cd /cgroup/cpuset/ root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ root@berbox39:/cgroup/cpuset# echo $$ sysdefault/tasks # In shell 1 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh # In shell 2 root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance I haven't found the exact magic combination but I've been going back and forth adding/killing burn.sh processes and toggling cpuset.sched_load_balance and in a couple of minutes I can usually get the machine to trigger the bug. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote: On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. I've managed to reproduce this a couple times now on 3.1.9 I'll give this a try later with a more recent kernel. Here is what I've done to reproduce the issue. # Setup in shell 1 root@berbox39:/cgroup/cpuset# mkdir package0 root@berbox39:/cgroup/cpuset# echo 0 package0/cpuset.mems root@berbox39:/cgroup/cpuset# echo 0,2,4,6 package0/cpuset.cpus root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# echo 1,3,5,7 sysdefault/cpuset.cpus root@berbox39:/cgroup/cpuset# echo 0 sysdefault/cpuset.mems root@berbox39:/cgroup/cpuset# echo $$ package0/tasks # Setup in shell 2 root@berbox39:~# cd /cgroup/cpuset/ root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ root@berbox39:/cgroup/cpuset# echo $$ sysdefault/tasks # In shell 1 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh # In shell 2 root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance I haven't found the exact magic combination but I've been going back and forth adding/killing burn.sh processes and toggling cpuset.sched_load_balance and in a couple of minutes I can usually get the machine to trigger the bug. I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Also just re-reading my instructions above /root/burn.sh is just a simple: while true; do : ; done I've also had to make the kworker threads SCHED_FIFO with a higher priority than burn.sh or as expected I can lock up the system due to some xfs threads getting starved. Let me know if anyone needs any more information, or needs me to try anything since it appears I can trigger this fairly easily now. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel BUG at kernel/sched_rt.c:493!
We recently managed to crash 10 of our test machines at the same time. Half of the machines were running a 3.1.9 kernel and half were running 3.4.9. I realize that these are both fairly old kernels but I've skimmed the list of fixes in the 3.4.* stable series and didn't see anything that appeared to be relevant to this issue. All we managed to get was some screenshots of the stacks from the consoles. On one of the 3.1.9 machines you can see we hit the BUG_ON(want) statement in __disable_runtime() at kernel/sched_rt.c:493, and all of the machines had essentially the same stack showing: rt_offline_rt rq_attach_root cpu_attach_domain partition_sched_domains do_rebuild_sched_domains Here is one of the screenshots of the 3.1.9 machines: https://dl.dropbox.com/u/84066079/berbox38.png And here is one from a 3.4.9 machine: https://dl.dropbox.com/u/84066079/berbox18.png Three of the five 3.4.9 machines also managed to print "[sched_delayed] sched: RT throttling activated" ~7 minutes before the machines locked up. I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. The details that I do know about the workload that caused this are as follows. 1) These are all dual socket 4 core X5460 systems with no hyperthreading. Thus there are 8 cores total in the system. 2) We use the cpuset cgroup to apply CPU affinity to various types of processes. Initially everything starts out in a single cpuset and the top level cpuset has cpuset.sched_load_balance=1 thus there is only a single scheduling domain. 3) In this case tasks were then placed into four non overlapping cpusets. 1 containing a single core and single SCHED_FIFO task, 2 containing two cores, and multiple SCHED_FIFO tasks, and 1 containing 3 cores and everything else on the system running as SCHED_OTHER. 4) In the case of cpusets that contain SCHED_FIFO tasks, the tasks start out as SCHED_OTHER are placed into the cpuset then change their policy to SCHED_FIFO. 5) Once all tasks are placed into non overlapping cpusets the top level cpuset.sched_load_balance is set to 0 to split the system into four scheduling domains. 6) The system ran like this for some unknown amount of time. 7) All the processes are then sent a signal to exit, and at the same time the top level cpuset.sched_load_balance is set back to 1. This is when the systems locked up. Hopefully that is enough information to give someone more familiar with the scheduler code an idea of where the bug is here. I will point out that in step #5 above there is a small window where the RT tasks could encounter runtime limits but are still in a single big scheduling domain. I don't know if that is what happened or if it is simply sufficient to hit the runtime limits while the system is split into four domains. For the curious we are using the default RT runtime limits: # grep . /proc/sys/kernel/sched_rt_* /proc/sys/kernel/sched_rt_period_us:100 /proc/sys/kernel/sched_rt_runtime_us:95 Let me know if you anyone needs any more information about this issue. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel BUG at kernel/sched_rt.c:493!
We recently managed to crash 10 of our test machines at the same time. Half of the machines were running a 3.1.9 kernel and half were running 3.4.9. I realize that these are both fairly old kernels but I've skimmed the list of fixes in the 3.4.* stable series and didn't see anything that appeared to be relevant to this issue. All we managed to get was some screenshots of the stacks from the consoles. On one of the 3.1.9 machines you can see we hit the BUG_ON(want) statement in __disable_runtime() at kernel/sched_rt.c:493, and all of the machines had essentially the same stack showing: rt_offline_rt rq_attach_root cpu_attach_domain partition_sched_domains do_rebuild_sched_domains Here is one of the screenshots of the 3.1.9 machines: https://dl.dropbox.com/u/84066079/berbox38.png And here is one from a 3.4.9 machine: https://dl.dropbox.com/u/84066079/berbox18.png Three of the five 3.4.9 machines also managed to print [sched_delayed] sched: RT throttling activated ~7 minutes before the machines locked up. I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. The details that I do know about the workload that caused this are as follows. 1) These are all dual socket 4 core X5460 systems with no hyperthreading. Thus there are 8 cores total in the system. 2) We use the cpuset cgroup to apply CPU affinity to various types of processes. Initially everything starts out in a single cpuset and the top level cpuset has cpuset.sched_load_balance=1 thus there is only a single scheduling domain. 3) In this case tasks were then placed into four non overlapping cpusets. 1 containing a single core and single SCHED_FIFO task, 2 containing two cores, and multiple SCHED_FIFO tasks, and 1 containing 3 cores and everything else on the system running as SCHED_OTHER. 4) In the case of cpusets that contain SCHED_FIFO tasks, the tasks start out as SCHED_OTHER are placed into the cpuset then change their policy to SCHED_FIFO. 5) Once all tasks are placed into non overlapping cpusets the top level cpuset.sched_load_balance is set to 0 to split the system into four scheduling domains. 6) The system ran like this for some unknown amount of time. 7) All the processes are then sent a signal to exit, and at the same time the top level cpuset.sched_load_balance is set back to 1. This is when the systems locked up. Hopefully that is enough information to give someone more familiar with the scheduler code an idea of where the bug is here. I will point out that in step #5 above there is a small window where the RT tasks could encounter runtime limits but are still in a single big scheduling domain. I don't know if that is what happened or if it is simply sufficient to hit the runtime limits while the system is split into four domains. For the curious we are using the default RT runtime limits: # grep . /proc/sys/kernel/sched_rt_* /proc/sys/kernel/sched_rt_period_us:100 /proc/sys/kernel/sched_rt_runtime_us:95 Let me know if you anyone needs any more information about this issue. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mlx4_en_alloc_frag allocation failures
On Fri, Sep 28, 2012 at 05:50:08PM +0200, Eric Dumazet wrote: > On Fri, 2012-09-28 at 10:14 -0500, Shawn Bohrer wrote: > > We've got a new application that is receiving UDP multicast data using > > AF_PACKET and writing out the packets in a custom format to disk. The > > packet rates are bursty, but it seems to be roughly 100 Mbps on > > average for 1 minute periods. With this application running all day > > we get a lot of these messages: > > > > [1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020 > > [1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 > > #1 > > [1298269.103040] Call Trace: > > [1298269.103041][] warn_alloc_failed+0xf6/0x160 > > [1298269.103053] [] ? skb_copy_bits+0x16d/0x2c0 > > [1298269.103058] [] ? wakeup_kswapd+0x69/0x160 > > [1298269.103060] [] __alloc_pages_nodemask+0x6e8/0x930 > > [1298269.103064] [] alloc_pages_current+0xb6/0x120 > > [1298269.103070] [] mlx4_en_alloc_frag+0x16b/0x1e0 > > [mlx4_en] > > [1298269.103073] [] mlx4_en_complete_rx_desc+0x120/0x1d0 > > [mlx4_en] > > [1298269.103076] [] mlx4_en_process_rx_cq+0x584/0x700 > > [mlx4_en] > > [1298269.103079] [] mlx4_en_poll_rx_cq+0x3f/0x80 > > [mlx4_en] > > [1298269.103083] [] net_rx_action+0x119/0x210 > > [1298269.103086] [] __do_softirq+0xb0/0x220 > > [1298269.103090] [] ? handle_irq_event+0x4d/0x70 > > [1298269.103095] [] call_softirq+0x1c/0x30 > > [1298269.103100] [] do_softirq+0x55/0x90 > > [1298269.103101] [] irq_exit+0x75/0x80 > > [1298269.103103] [] do_IRQ+0x63/0xe0 > > [1298269.103107] [] common_interrupt+0x67/0x67 > > [1298269.103108][] ? > > _raw_spin_unlock_irqrestore+0xf/0x20 > > [1298269.103113] [] compaction_alloc+0x361/0x3f0 > > [1298269.103115] [] ? pagevec_lru_move_fn+0xd7/0xf0 > > [1298269.103118] [] migrate_pages+0xa9/0x470 > > [1298269.103120] [] ? > > perf_trace_mm_compaction_migratepages+0xd0/0xd0 > > [1298269.103122] [] compact_zone+0x4cb/0x910 > > [1298269.103124] [] __compact_pgdat+0x14b/0x190 > > [1298269.103125] [] compact_pgdat+0x2d/0x30 > > [1298269.103129] [] ? fragmentation_index+0x19/0x70 > > [1298269.103131] [] balance_pgdat+0x6ef/0x710 > > [1298269.103133] [] kswapd+0x14a/0x390 > > [1298269.103136] [] ? add_wait_queue+0x60/0x60 > > [1298269.103138] [] ? balance_pgdat+0x710/0x710 > > [1298269.103140] [] kthread+0x93/0xa0 > > [1298269.103142] [] kernel_thread_helper+0x4/0x10 > > [1298269.103144] [] ? kthread_worker_fn+0x140/0x140 > > [1298269.103146] [] ? gs_change+0xb/0xb > > > > The kernel is based on a Fedora 16 kernel and actually has the 3.4.10 > > patches applied. I can easily test patches or different kernels. > > > > I'm mostly wondering if there is anything that can be done about these > > failures? It appears that these failures have to do with handling > > fragmented IP frames, but the majority of the packets this machines > > should not be fragmented (there are probably some that are). > > > > From a memory management point of view the system has 48GB of RAM, and > > typically 44GB of that is page cache. The dirty pages seem to hover > > around 5-6MB and the filesystem/disks don't seem to have any problems > > keeping up with writing out the data. > > What is the value of /proc/sys/vm/min_free_kbytes ? $ cat /proc/sys/vm/min_free_kbytes 90112 -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mlx4_en_alloc_frag allocation failures
We've got a new application that is receiving UDP multicast data using AF_PACKET and writing out the packets in a custom format to disk. The packet rates are bursty, but it seems to be roughly 100 Mbps on average for 1 minute periods. With this application running all day we get a lot of these messages: [1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020 [1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 #1 [1298269.103040] Call Trace: [1298269.103041][] warn_alloc_failed+0xf6/0x160 [1298269.103053] [] ? skb_copy_bits+0x16d/0x2c0 [1298269.103058] [] ? wakeup_kswapd+0x69/0x160 [1298269.103060] [] __alloc_pages_nodemask+0x6e8/0x930 [1298269.103064] [] alloc_pages_current+0xb6/0x120 [1298269.103070] [] mlx4_en_alloc_frag+0x16b/0x1e0 [mlx4_en] [1298269.103073] [] mlx4_en_complete_rx_desc+0x120/0x1d0 [mlx4_en] [1298269.103076] [] mlx4_en_process_rx_cq+0x584/0x700 [mlx4_en] [1298269.103079] [] mlx4_en_poll_rx_cq+0x3f/0x80 [mlx4_en] [1298269.103083] [] net_rx_action+0x119/0x210 [1298269.103086] [] __do_softirq+0xb0/0x220 [1298269.103090] [] ? handle_irq_event+0x4d/0x70 [1298269.103095] [] call_softirq+0x1c/0x30 [1298269.103100] [] do_softirq+0x55/0x90 [1298269.103101] [] irq_exit+0x75/0x80 [1298269.103103] [] do_IRQ+0x63/0xe0 [1298269.103107] [] common_interrupt+0x67/0x67 [1298269.103108][] ? _raw_spin_unlock_irqrestore+0xf/0x20 [1298269.103113] [] compaction_alloc+0x361/0x3f0 [1298269.103115] [] ? pagevec_lru_move_fn+0xd7/0xf0 [1298269.103118] [] migrate_pages+0xa9/0x470 [1298269.103120] [] ? perf_trace_mm_compaction_migratepages+0xd0/0xd0 [1298269.103122] [] compact_zone+0x4cb/0x910 [1298269.103124] [] __compact_pgdat+0x14b/0x190 [1298269.103125] [] compact_pgdat+0x2d/0x30 [1298269.103129] [] ? fragmentation_index+0x19/0x70 [1298269.103131] [] balance_pgdat+0x6ef/0x710 [1298269.103133] [] kswapd+0x14a/0x390 [1298269.103136] [] ? add_wait_queue+0x60/0x60 [1298269.103138] [] ? balance_pgdat+0x710/0x710 [1298269.103140] [] kthread+0x93/0xa0 [1298269.103142] [] kernel_thread_helper+0x4/0x10 [1298269.103144] [] ? kthread_worker_fn+0x140/0x140 [1298269.103146] [] ? gs_change+0xb/0xb The kernel is based on a Fedora 16 kernel and actually has the 3.4.10 patches applied. I can easily test patches or different kernels. I'm mostly wondering if there is anything that can be done about these failures? It appears that these failures have to do with handling fragmented IP frames, but the majority of the packets this machines should not be fragmented (there are probably some that are). >From a memory management point of view the system has 48GB of RAM, and typically 44GB of that is page cache. The dirty pages seem to hover around 5-6MB and the filesystem/disks don't seem to have any problems keeping up with writing out the data. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mlx4_en_alloc_frag allocation failures
We've got a new application that is receiving UDP multicast data using AF_PACKET and writing out the packets in a custom format to disk. The packet rates are bursty, but it seems to be roughly 100 Mbps on average for 1 minute periods. With this application running all day we get a lot of these messages: [1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020 [1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 #1 [1298269.103040] Call Trace: [1298269.103041] IRQ [810db746] warn_alloc_failed+0xf6/0x160 [1298269.103053] [813c767d] ? skb_copy_bits+0x16d/0x2c0 [1298269.103058] [810e83a9] ? wakeup_kswapd+0x69/0x160 [1298269.103060] [810df188] __alloc_pages_nodemask+0x6e8/0x930 [1298269.103064] [81114316] alloc_pages_current+0xb6/0x120 [1298269.103070] [a00c142b] mlx4_en_alloc_frag+0x16b/0x1e0 [mlx4_en] [1298269.103073] [a00c18a0] mlx4_en_complete_rx_desc+0x120/0x1d0 [mlx4_en] [1298269.103076] [a00c27d4] mlx4_en_process_rx_cq+0x584/0x700 [mlx4_en] [1298269.103079] [a00c29ef] mlx4_en_poll_rx_cq+0x3f/0x80 [mlx4_en] [1298269.103083] [813d6569] net_rx_action+0x119/0x210 [1298269.103086] [8103c690] __do_softirq+0xb0/0x220 [1298269.103090] [8109911d] ? handle_irq_event+0x4d/0x70 [1298269.103095] [8148e30c] call_softirq+0x1c/0x30 [1298269.103100] [81003ef5] do_softirq+0x55/0x90 [1298269.103101] [8103ca65] irq_exit+0x75/0x80 [1298269.103103] [8148e853] do_IRQ+0x63/0xe0 [1298269.103107] [81485667] common_interrupt+0x67/0x67 [1298269.103108] EOI [8148523f] ? _raw_spin_unlock_irqrestore+0xf/0x20 [1298269.103113] [811184b1] compaction_alloc+0x361/0x3f0 [1298269.103115] [810e29b7] ? pagevec_lru_move_fn+0xd7/0xf0 [1298269.103118] [81123d19] migrate_pages+0xa9/0x470 [1298269.103120] [81118150] ? perf_trace_mm_compaction_migratepages+0xd0/0xd0 [1298269.103122] [81118abb] compact_zone+0x4cb/0x910 [1298269.103124] [8111904b] __compact_pgdat+0x14b/0x190 [1298269.103125] [8111931d] compact_pgdat+0x2d/0x30 [1298269.103129] [810f32b9] ? fragmentation_index+0x19/0x70 [1298269.103131] [810eb15f] balance_pgdat+0x6ef/0x710 [1298269.103133] [810eb2ca] kswapd+0x14a/0x390 [1298269.103136] [810567c0] ? add_wait_queue+0x60/0x60 [1298269.103138] [810eb180] ? balance_pgdat+0x710/0x710 [1298269.103140] [81055e93] kthread+0x93/0xa0 [1298269.103142] [8148e214] kernel_thread_helper+0x4/0x10 [1298269.103144] [81055e00] ? kthread_worker_fn+0x140/0x140 [1298269.103146] [8148e210] ? gs_change+0xb/0xb The kernel is based on a Fedora 16 kernel and actually has the 3.4.10 patches applied. I can easily test patches or different kernels. I'm mostly wondering if there is anything that can be done about these failures? It appears that these failures have to do with handling fragmented IP frames, but the majority of the packets this machines should not be fragmented (there are probably some that are). From a memory management point of view the system has 48GB of RAM, and typically 44GB of that is page cache. The dirty pages seem to hover around 5-6MB and the filesystem/disks don't seem to have any problems keeping up with writing out the data. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mlx4_en_alloc_frag allocation failures
On Fri, Sep 28, 2012 at 05:50:08PM +0200, Eric Dumazet wrote: On Fri, 2012-09-28 at 10:14 -0500, Shawn Bohrer wrote: We've got a new application that is receiving UDP multicast data using AF_PACKET and writing out the packets in a custom format to disk. The packet rates are bursty, but it seems to be roughly 100 Mbps on average for 1 minute periods. With this application running all day we get a lot of these messages: [1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020 [1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 #1 [1298269.103040] Call Trace: [1298269.103041] IRQ [810db746] warn_alloc_failed+0xf6/0x160 [1298269.103053] [813c767d] ? skb_copy_bits+0x16d/0x2c0 [1298269.103058] [810e83a9] ? wakeup_kswapd+0x69/0x160 [1298269.103060] [810df188] __alloc_pages_nodemask+0x6e8/0x930 [1298269.103064] [81114316] alloc_pages_current+0xb6/0x120 [1298269.103070] [a00c142b] mlx4_en_alloc_frag+0x16b/0x1e0 [mlx4_en] [1298269.103073] [a00c18a0] mlx4_en_complete_rx_desc+0x120/0x1d0 [mlx4_en] [1298269.103076] [a00c27d4] mlx4_en_process_rx_cq+0x584/0x700 [mlx4_en] [1298269.103079] [a00c29ef] mlx4_en_poll_rx_cq+0x3f/0x80 [mlx4_en] [1298269.103083] [813d6569] net_rx_action+0x119/0x210 [1298269.103086] [8103c690] __do_softirq+0xb0/0x220 [1298269.103090] [8109911d] ? handle_irq_event+0x4d/0x70 [1298269.103095] [8148e30c] call_softirq+0x1c/0x30 [1298269.103100] [81003ef5] do_softirq+0x55/0x90 [1298269.103101] [8103ca65] irq_exit+0x75/0x80 [1298269.103103] [8148e853] do_IRQ+0x63/0xe0 [1298269.103107] [81485667] common_interrupt+0x67/0x67 [1298269.103108] EOI [8148523f] ? _raw_spin_unlock_irqrestore+0xf/0x20 [1298269.103113] [811184b1] compaction_alloc+0x361/0x3f0 [1298269.103115] [810e29b7] ? pagevec_lru_move_fn+0xd7/0xf0 [1298269.103118] [81123d19] migrate_pages+0xa9/0x470 [1298269.103120] [81118150] ? perf_trace_mm_compaction_migratepages+0xd0/0xd0 [1298269.103122] [81118abb] compact_zone+0x4cb/0x910 [1298269.103124] [8111904b] __compact_pgdat+0x14b/0x190 [1298269.103125] [8111931d] compact_pgdat+0x2d/0x30 [1298269.103129] [810f32b9] ? fragmentation_index+0x19/0x70 [1298269.103131] [810eb15f] balance_pgdat+0x6ef/0x710 [1298269.103133] [810eb2ca] kswapd+0x14a/0x390 [1298269.103136] [810567c0] ? add_wait_queue+0x60/0x60 [1298269.103138] [810eb180] ? balance_pgdat+0x710/0x710 [1298269.103140] [81055e93] kthread+0x93/0xa0 [1298269.103142] [8148e214] kernel_thread_helper+0x4/0x10 [1298269.103144] [81055e00] ? kthread_worker_fn+0x140/0x140 [1298269.103146] [8148e210] ? gs_change+0xb/0xb The kernel is based on a Fedora 16 kernel and actually has the 3.4.10 patches applied. I can easily test patches or different kernels. I'm mostly wondering if there is anything that can be done about these failures? It appears that these failures have to do with handling fragmented IP frames, but the majority of the packets this machines should not be fragmented (there are probably some that are). From a memory management point of view the system has 48GB of RAM, and typically 44GB of that is page cache. The dirty pages seem to hover around 5-6MB and the filesystem/disks don't seem to have any problems keeping up with writing out the data. What is the value of /proc/sys/vm/min_free_kbytes ? $ cat /proc/sys/vm/min_free_kbytes 90112 -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] CodingStyle updates
On Fri, Sep 28, 2007 at 05:32:00PM -0400, Erez Zadok wrote: > 1. Updates chapter 13 (printing kernel messages) to expand on the use of >pr_debug()/pr_info(), what to avoid, and how to hook your debug code with >kernel.h. > > 2. New chapter 19, branch prediction optimizations, discusses the whole >un/likely issue. > > Cc: "Kok, Auke" <[EMAIL PROTECTED]> > Cc: Kyle Moffett <[EMAIL PROTECTED]> > Cc: Jan Engelhardt <[EMAIL PROTECTED]> > Cc: Adrian Bunk <[EMAIL PROTECTED]> > Cc: roel <[EMAIL PROTECTED]> > > Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> > --- > Documentation/CodingStyle | 88 +++- > 1 files changed, 86 insertions(+), 2 deletions(-) > > diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle > index 7f1730f..00b29e4 100644 > --- a/Documentation/CodingStyle > +++ b/Documentation/CodingStyle > @@ -643,8 +643,26 @@ Printing numbers in parentheses (%d) adds no value and > should be avoided. > There are a number of driver model diagnostic macros in > which you should use to make sure messages are matched to the right device > and driver, and are tagged with the right level: dev_err(), dev_warn(), > -dev_info(), and so forth. For messages that aren't associated with a > -particular device, defines pr_debug() and pr_info(). > +dev_info(), and so forth. > + > +A number of people often like to define their own debugging printf's, > +wrapping printk's in #ifdef's that get turned on only when subsystem > +debugging is compiled in (e.g., dprintk, Dprintk, DPRINTK, etc.). Please > +don't reinvent the wheel but use existing mechanisms. For messages that > +aren't associated with a particular device, defines > +pr_debug() and pr_info(); the latter two translate to printk(KERN_DEBUG) and The latter two? Since there are only two presented I think there is no reason to say "latter". > +printk(KERN_INFO), respectively. However, to get pr_debug() to actually > +emit the message, you'll need to turn on DEBUG in your code, which can be > +done as follows in your subsystem Makefile: > + > +ifeq ($(CONFIG_WHATEVER_DEBUG),y) > +EXTRA_CFLAGS += -DDEBUG > +endif > + > +In this way, you can create a Kconfig parameter to turn on debugging at > +compile time, which will also turn on DEBUG, to enable pr_debug() to emit > +actual messages; conversely, when CONFIG_WHATEVER_DEBUG is off, DEBUG is > +off, and pr_debug() will display nothing. > > Coming up with good debugging messages can be quite a challenge; and once > you have them, they can be a huge help for remote troubleshooting. Such > @@ -779,6 +797,69 @@ includes markers for indentation and mode configuration. > People may use their > own custom mode, or may have some other magic method for making indentation > work correctly. > > + Chapter 19: branch prediction optimizations > + > +The kernel includes macros called likely() and unlikely(), which can be used > +as hints to the compiler to optimize branch prediction. They operate by > +asking gcc to shuffle the code around so that the more favorable outcome > +executes linearly, avoiding a JMP instruction; this can improve cache > +pipeline efficiency. For technical details how these macros work, see the > +References section at the end of this document. > + > +An example use of this as as follows: ^^ > + > + ptr = kmalloc(size, GFP_KERNEL); > + if (unlikely(!ptr)) > + ... > + > +or > + err = some_function(...); > + if (likely(!err)) > + ... -- Shawn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] CodingStyle updates
On Fri, Sep 28, 2007 at 05:32:00PM -0400, Erez Zadok wrote: 1. Updates chapter 13 (printing kernel messages) to expand on the use of pr_debug()/pr_info(), what to avoid, and how to hook your debug code with kernel.h. 2. New chapter 19, branch prediction optimizations, discusses the whole un/likely issue. Cc: Kok, Auke [EMAIL PROTECTED] Cc: Kyle Moffett [EMAIL PROTECTED] Cc: Jan Engelhardt [EMAIL PROTECTED] Cc: Adrian Bunk [EMAIL PROTECTED] Cc: roel [EMAIL PROTECTED] Signed-off-by: Erez Zadok [EMAIL PROTECTED] --- Documentation/CodingStyle | 88 +++- 1 files changed, 86 insertions(+), 2 deletions(-) diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index 7f1730f..00b29e4 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -643,8 +643,26 @@ Printing numbers in parentheses (%d) adds no value and should be avoided. There are a number of driver model diagnostic macros in linux/device.h which you should use to make sure messages are matched to the right device and driver, and are tagged with the right level: dev_err(), dev_warn(), -dev_info(), and so forth. For messages that aren't associated with a -particular device, linux/kernel.h defines pr_debug() and pr_info(). +dev_info(), and so forth. + +A number of people often like to define their own debugging printf's, +wrapping printk's in #ifdef's that get turned on only when subsystem +debugging is compiled in (e.g., dprintk, Dprintk, DPRINTK, etc.). Please +don't reinvent the wheel but use existing mechanisms. For messages that +aren't associated with a particular device, linux/kernel.h defines +pr_debug() and pr_info(); the latter two translate to printk(KERN_DEBUG) and The latter two? Since there are only two presented I think there is no reason to say latter. +printk(KERN_INFO), respectively. However, to get pr_debug() to actually +emit the message, you'll need to turn on DEBUG in your code, which can be +done as follows in your subsystem Makefile: + +ifeq ($(CONFIG_WHATEVER_DEBUG),y) +EXTRA_CFLAGS += -DDEBUG +endif + +In this way, you can create a Kconfig parameter to turn on debugging at +compile time, which will also turn on DEBUG, to enable pr_debug() to emit +actual messages; conversely, when CONFIG_WHATEVER_DEBUG is off, DEBUG is +off, and pr_debug() will display nothing. Coming up with good debugging messages can be quite a challenge; and once you have them, they can be a huge help for remote troubleshooting. Such @@ -779,6 +797,69 @@ includes markers for indentation and mode configuration. People may use their own custom mode, or may have some other magic method for making indentation work correctly. + Chapter 19: branch prediction optimizations + +The kernel includes macros called likely() and unlikely(), which can be used +as hints to the compiler to optimize branch prediction. They operate by +asking gcc to shuffle the code around so that the more favorable outcome +executes linearly, avoiding a JMP instruction; this can improve cache +pipeline efficiency. For technical details how these macros work, see the +References section at the end of this document. + +An example use of this as as follows: ^^ + + ptr = kmalloc(size, GFP_KERNEL); + if (unlikely(!ptr)) + ... + +or + err = some_function(...); + if (likely(!err)) + ... -- Shawn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
On Tue, Jul 17, 2007 at 02:57:45AM +0200, Rene Herman wrote: > True enough. I'm rather wondering though why RHEL is shipping with it if > it's a _real_ problem. Scribbling junk all over kernel memory would be the > kind of thing I'd imagine you'd mightely piss-off enterprise customers with. > But well, sure, that rather quickly becomes a self-referential argument I > guess. I can't speak for Fedora, but RHEL disables XFS in their kernel likely because it is known to cause problems with 4K stacks. > Well, no. "oldconfig" works fine, and other than that, all failure modes > I've heard about also in this thread are MD/LVM/XFS. This is extremely > widely tested stuff in at least Fedora and RHEL. Again don't assume that because Fedora and RHEL have 4K stacks means that MD/LVM/XFS is widely tested. Additionally I think I should point out that the problems pointed out so far are not the only problem areas with 4K stacks. There are out of tree drivers to consider as well, and use cases like ndiswrapper. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
On Tue, Jul 17, 2007 at 02:57:45AM +0200, Rene Herman wrote: True enough. I'm rather wondering though why RHEL is shipping with it if it's a _real_ problem. Scribbling junk all over kernel memory would be the kind of thing I'd imagine you'd mightely piss-off enterprise customers with. But well, sure, that rather quickly becomes a self-referential argument I guess. I can't speak for Fedora, but RHEL disables XFS in their kernel likely because it is known to cause problems with 4K stacks. Well, no. oldconfig works fine, and other than that, all failure modes I've heard about also in this thread are MD/LVM/XFS. This is extremely widely tested stuff in at least Fedora and RHEL. Again don't assume that because Fedora and RHEL have 4K stacks means that MD/LVM/XFS is widely tested. Additionally I think I should point out that the problems pointed out so far are not the only problem areas with 4K stacks. There are out of tree drivers to consider as well, and use cases like ndiswrapper. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/