3.14 stable regression don't remove from shrink list in select_collect()

2016-01-25 Thread Shawn Bohrer
I recently updated some machines to 3.14.58 and they reliably get soft
lockups.  Sometimes the soft lockup recovers and some times it does
not.  I've bisected this on the 3.14 stable branch and arrived at:

c214cb82cdc744225d85899fc138251527f75fff don't remove from shrink
list in select_collect()

Reverting this commit plus adding back dentry_lru_del() which was
removed later on top of 3.14.58 resolves the issue for me.  I've
included a patch at the bottom with the revert.

So far this issue has been easy to reproduce for me so I'm happy to
try other patches for further debugging or testing.  I have not yet
tried latest upstream to see if it also has the issue.  Below are the
Soft lockup messages:


[   76.423941] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613]
[   76.538222] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc 
dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 
jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd 
glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 
gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler 
wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core 
lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc 
ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd 
ipv6 autofs4
[   76.538260] CPU: 10 PID: 3613 Comm: systemd-udevd Tainted: G   O 
3.14.50-00038-gc214cb8 #22
[   76.538261] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.3.0 
07/24/2012
[   76.538262] task: 881806e45280 ti: 8817823d6000 task.ti: 
8817823d6000
[   76.538263] RIP: 0010:[]  [] 
_raw_spin_trylock+0x2c/0x40
[   76.538270] RSP: 0018:8817823d7b40  EFLAGS: 0246
[   76.538271] RAX: dd75dd75 RBX: 8817823d7ae8 RCX: dd76dd75
[   76.538272] RDX: dd75dd75 RSI:  RDI: 88178509e638
[   76.538273] RBP: 8817823d7b40 R08: 88178354a1e0 R09: 000180220001
[   76.538274] R10: 811c9b96 R11: ea005e0d5280 R12: 880b7f097198
[   76.538275] R13: 8817774d2c30 R14: 88178354a1e0 R15: 88178354a1e0
[   76.538276] FS:  7f20b4868880() GS:88180f2a() 
knlGS:
[   76.538277] CS:  0010 DS:  ES:  CR0: 80050033
[   76.538278] CR2: 7f20b61e6f58 CR3: 00180652f000 CR4: 07e0
[   76.538279] Stack:
[   76.538280]  8817823d7b70 8116eff6 880b7ac5ef80 
8817823d7bc0
[   76.538283]  880b7ac5ef00 880b7ac5ef00 8817823d7ba8 
8116f459
[   76.538285]  880b7ac5ef80 8817823d7bc0 880b7ac5ecc0 
000c
[   76.538287] Call Trace:
[   76.538292]  [] dentry_kill+0x36/0x290
[   76.538294]  [] shrink_dentry_list+0x79/0xd0
[   76.538296]  [] check_submounts_and_drop+0x74/0xa0
[   76.538301]  [] kernfs_dop_revalidate+0x5c/0xd0
[   76.538306]  [] lookup_fast+0x26d/0x2c0
[   76.538307]  [] link_path_walk+0x1d9/0x890
[   76.538311]  [] ? kmem_cache_alloc+0x31/0x140
[   76.538313]  [] ? kernfs_name_hash+0x17/0xd0
[   76.538315]  [] ? __mutex_unlock_slowpath+0x16/0x40
[   76.538317]  [] path_lookupat+0x5b/0x770
[   76.538318]  [] ? __d_free+0x35/0x40
[   76.538320]  [] ? dentry_kill+0x215/0x290
[   76.538321]  [] ? kmem_cache_alloc+0x31/0x140
[   76.538323]  [] ? getname_flags+0x2c/0x120
[   76.538325]  [] filename_lookup.isra.50+0x26/0x60
[   76.538327]  [] user_path_at_empty+0x54/0x90
[   76.538329]  [] ? final_putname+0x22/0x50
[   76.538330]  [] ? user_path_at_empty+0x5f/0x90
[   76.538332]  [] user_path_at+0x11/0x20
[   76.538334]  [] vfs_fstatat+0x50/0xa0
[   76.538336]  [] SYSC_newlstat+0x22/0x40
[   76.538338]  [] ? SyS_readlink+0x4c/0x110
[   76.538339]  [] SyS_newlstat+0xe/0x10
[   76.538343]  [] system_call_fastpath+0x16/0x1b
[   76.538344] Code: 66 66 66 90 55 48 89 e5 8b 17 89 d0 c1 e8 10 66 39 c2 74 
0b 31 c0 5d c3 0f 1f 80 00 00 00 00 8d 8a 00 00 01 00 89 d0 f0 0f b1 0f <39> d0 
75 e5 b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 66 
[  104.426665] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613]
[  104.537859] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc 
dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 
jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd 
glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 
gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler 
wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core 
lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc 
ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd 
ipv6 autofs4
[  104.537895] CPU: 10 PID: 

3.14 stable regression don't remove from shrink list in select_collect()

2016-01-25 Thread Shawn Bohrer
I recently updated some machines to 3.14.58 and they reliably get soft
lockups.  Sometimes the soft lockup recovers and some times it does
not.  I've bisected this on the 3.14 stable branch and arrived at:

c214cb82cdc744225d85899fc138251527f75fff don't remove from shrink
list in select_collect()

Reverting this commit plus adding back dentry_lru_del() which was
removed later on top of 3.14.58 resolves the issue for me.  I've
included a patch at the bottom with the revert.

So far this issue has been easy to reproduce for me so I'm happy to
try other patches for further debugging or testing.  I have not yet
tried latest upstream to see if it also has the issue.  Below are the
Soft lockup messages:


[   76.423941] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613]
[   76.538222] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc 
dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 
jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd 
glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 
gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler 
wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core 
lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc 
ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd 
ipv6 autofs4
[   76.538260] CPU: 10 PID: 3613 Comm: systemd-udevd Tainted: G   O 
3.14.50-00038-gc214cb8 #22
[   76.538261] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.3.0 
07/24/2012
[   76.538262] task: 881806e45280 ti: 8817823d6000 task.ti: 
8817823d6000
[   76.538263] RIP: 0010:[]  [] 
_raw_spin_trylock+0x2c/0x40
[   76.538270] RSP: 0018:8817823d7b40  EFLAGS: 0246
[   76.538271] RAX: dd75dd75 RBX: 8817823d7ae8 RCX: dd76dd75
[   76.538272] RDX: dd75dd75 RSI:  RDI: 88178509e638
[   76.538273] RBP: 8817823d7b40 R08: 88178354a1e0 R09: 000180220001
[   76.538274] R10: 811c9b96 R11: ea005e0d5280 R12: 880b7f097198
[   76.538275] R13: 8817774d2c30 R14: 88178354a1e0 R15: 88178354a1e0
[   76.538276] FS:  7f20b4868880() GS:88180f2a() 
knlGS:
[   76.538277] CS:  0010 DS:  ES:  CR0: 80050033
[   76.538278] CR2: 7f20b61e6f58 CR3: 00180652f000 CR4: 07e0
[   76.538279] Stack:
[   76.538280]  8817823d7b70 8116eff6 880b7ac5ef80 
8817823d7bc0
[   76.538283]  880b7ac5ef00 880b7ac5ef00 8817823d7ba8 
8116f459
[   76.538285]  880b7ac5ef80 8817823d7bc0 880b7ac5ecc0 
000c
[   76.538287] Call Trace:
[   76.538292]  [] dentry_kill+0x36/0x290
[   76.538294]  [] shrink_dentry_list+0x79/0xd0
[   76.538296]  [] check_submounts_and_drop+0x74/0xa0
[   76.538301]  [] kernfs_dop_revalidate+0x5c/0xd0
[   76.538306]  [] lookup_fast+0x26d/0x2c0
[   76.538307]  [] link_path_walk+0x1d9/0x890
[   76.538311]  [] ? kmem_cache_alloc+0x31/0x140
[   76.538313]  [] ? kernfs_name_hash+0x17/0xd0
[   76.538315]  [] ? __mutex_unlock_slowpath+0x16/0x40
[   76.538317]  [] path_lookupat+0x5b/0x770
[   76.538318]  [] ? __d_free+0x35/0x40
[   76.538320]  [] ? dentry_kill+0x215/0x290
[   76.538321]  [] ? kmem_cache_alloc+0x31/0x140
[   76.538323]  [] ? getname_flags+0x2c/0x120
[   76.538325]  [] filename_lookup.isra.50+0x26/0x60
[   76.538327]  [] user_path_at_empty+0x54/0x90
[   76.538329]  [] ? final_putname+0x22/0x50
[   76.538330]  [] ? user_path_at_empty+0x5f/0x90
[   76.538332]  [] user_path_at+0x11/0x20
[   76.538334]  [] vfs_fstatat+0x50/0xa0
[   76.538336]  [] SYSC_newlstat+0x22/0x40
[   76.538338]  [] ? SyS_readlink+0x4c/0x110
[   76.538339]  [] SyS_newlstat+0xe/0x10
[   76.538343]  [] system_call_fastpath+0x16/0x1b
[   76.538344] Code: 66 66 66 90 55 48 89 e5 8b 17 89 d0 c1 e8 10 66 39 c2 74 
0b 31 c0 5d c3 0f 1f 80 00 00 00 00 8d 8a 00 00 01 00 89 d0 f0 0f b1 0f <39> d0 
75 e5 b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 66 
[  104.426665] BUG: soft lockup - CPU#10 stuck for 23s! [systemd-udevd:3613]
[  104.537859] Modules linked in: vfat fat usb_storage 8021q mrp garp stp llc 
dell_rbu mpt2sas raid_class scsi_transport_sas mptctl mptbase sfc_aoe(O) ext4 
jbd2 mbcache coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper cryptd 
glue_helper sfc(O) ptp pps_core mdio lrw joydev hwmon i2c_algo_bit bnx2 
gf128mul ipmi_devintf serio_raw iTCO_wdt ses i2c_core ipmi_si ipmi_msghandler 
wmi enclosure iTCO_vendor_support microcode pcspkr dcdbas ioatdma ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dca ib_mad ib_core 
lpc_ich mfd_core ib_addr nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc 
ata_generic pata_acpi ata_piix libata megaraid_sas ehci_pci ehci_hcd uhci_hcd 
ipv6 autofs4
[  104.537895] CPU: 10 PID: 

Re: NFS Freezer and stuck tasks

2015-05-01 Thread Shawn Bohrer
On Fri, May 01, 2015 at 05:10:34PM -0400, Benjamin Coddington wrote:
> On Fri, 1 May 2015, Benjamin Coddington wrote:
> 
> > On Wed, 4 Mar 2015, Shawn Bohrer wrote:
> >
> > > Hello,
> > >
> > > We're using the Linux cgroup Freezer on some machines that use NFS and
> > > have run into what appears to be a bug where frozen tasks are blocking
> > > running tasks and preventing them from completing.  On one of our
> > > machines which happens to be running an older 3.10.46 kernel we have
> > > frozen some of the tasks on the system using the cgroup Freezer.  We
> > > also have a separate set of tasks which are NOT frozen which are stuck
> > > trying to open some files on NFS.
> > >
> > > Looking at the frozen tasks there are several that have the following
> > > stack:
> > >
> > > [] rpc_wait_bit_killable+0x35/0x80
> > > [] __rpc_wait_for_completion_task+0x2d/0x30
> > > [] nfs4_run_open_task+0x11d/0x170
> > > [] _nfs4_open_and_get_state+0x53/0x260
> > > [] nfs4_do_open+0x121/0x400
> > > [] nfs4_atomic_open+0x31/0x50
> > > [] nfs4_file_open+0xac/0x180
> > > [] do_dentry_open.isra.19+0x1ee/0x280
> > > [] finish_open+0x1e/0x30
> > > [] do_last.isra.64+0x2c2/0xc40
> > > [] path_openat.isra.65+0x2c9/0x490
> > > [] do_filp_open+0x38/0x80
> > > [] do_sys_open+0xe4/0x1c0
> > > [] SyS_open+0x1e/0x20
> > > [] system_call_fastpath+0x16/0x1b
> > > [] 0x
> > >
> > > Here it looks like we are waiting in a wait queue inside
> > > rpc_wait_bit_killable() for RPC_TASK_ACTIVE.
> > >
> > > And there is a single task with a stack that looks like the following:
> > >
> > > [] __refrigerator+0x55/0x150
> > > [] rpc_wait_bit_killable+0x66/0x80
> > > [] __rpc_wait_for_completion_task+0x2d/0x30
> > > [] nfs4_run_open_task+0x11d/0x170
> > > [] _nfs4_open_and_get_state+0x53/0x260
> > > [] nfs4_do_open+0x121/0x400
> > > [] nfs4_atomic_open+0x31/0x50
> > > [] nfs4_file_open+0xac/0x180
> > > [] do_dentry_open.isra.19+0x1ee/0x280
> > > [] finish_open+0x1e/0x30
> > > [] do_last.isra.64+0x2c2/0xc40
> > > [] path_openat.isra.65+0x2c9/0x490
> > > [] do_filp_open+0x38/0x80
> > > [] do_sys_open+0xe4/0x1c0
> > > [] SyS_open+0x1e/0x20
> > > [] system_call_fastpath+0x16/0x1b
> > > [] 0x
> > >
> > > This looks similar but the different offset into
> > > rpc_wait_bit_killable() shows that we have returned from the
> > > schedule() call in freezable_schedule() and are now blocked in
> > > __refrigerator() inside freezer_count()
> > >
> > > Similarly if you look at the tasks that are NOT frozen but are stuck
> > > opening a NFS file, they also have the following stack showing they are
> > > waiting in the wait queue for RPC_TASK_ACTIVE.
> > >
> > > [] rpc_wait_bit_killable+0x35/0x80
> > > [] __rpc_wait_for_completion_task+0x2d/0x30
> > > [] nfs4_run_open_task+0x11d/0x170
> > > [] _nfs4_open_and_get_state+0x53/0x260
> > > [] nfs4_do_open+0x121/0x400
> > > [] nfs4_atomic_open+0x31/0x50
> > > [] nfs4_file_open+0xac/0x180
> > > [] do_dentry_open.isra.19+0x1ee/0x280
> > > [] finish_open+0x1e/0x30
> > > [] do_last.isra.64+0x2c2/0xc40
> > > [] path_openat.isra.65+0x2c9/0x490
> > > [] do_filp_open+0x38/0x80
> > > [] do_sys_open+0xe4/0x1c0
> > > [] SyS_open+0x1e/0x20
> > > [] system_call_fastpath+0x16/0x1b
> > > [] 0x
> > >
> > > We have hit this a couple of times now and know that if we THAW all of
> > > the frozen tasks that running tasks will unwedge and finish.
> > >
> > > Additionally we have also tried thawing the single task that is frozen
> > > in __refrigerator() inside rpc_wait_bit_killable().  This usually
> > > results in different frozen task entering the __refrigerator() state
> > > inside rpc_wait_bit_killable().  It looks like each one of those tasks
> > > must wake up another letting it progress.  Again if you thaw enough of
> > > the frozen tasks eventually everything unwedges and everything
> > > completes.
> > >
> > > I've looked through the 3.10 stable patches since 3.10.46 and don't
> > > see anything that looks like it addresses this.  Does anyone have any
> > > idea what might be going on here, and what the fix might be?
> > >
> > &g

Re: NFS Freezer and stuck tasks

2015-05-01 Thread Shawn Bohrer
On Fri, May 01, 2015 at 05:10:34PM -0400, Benjamin Coddington wrote:
 On Fri, 1 May 2015, Benjamin Coddington wrote:
 
  On Wed, 4 Mar 2015, Shawn Bohrer wrote:
 
   Hello,
  
   We're using the Linux cgroup Freezer on some machines that use NFS and
   have run into what appears to be a bug where frozen tasks are blocking
   running tasks and preventing them from completing.  On one of our
   machines which happens to be running an older 3.10.46 kernel we have
   frozen some of the tasks on the system using the cgroup Freezer.  We
   also have a separate set of tasks which are NOT frozen which are stuck
   trying to open some files on NFS.
  
   Looking at the frozen tasks there are several that have the following
   stack:
  
   [814fd055] rpc_wait_bit_killable+0x35/0x80
   [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30
   [811dce5d] nfs4_run_open_task+0x11d/0x170
   [811de7a3] _nfs4_open_and_get_state+0x53/0x260
   [811e12d1] nfs4_do_open+0x121/0x400
   [811e15e1] nfs4_atomic_open+0x31/0x50
   [811f02dc] nfs4_file_open+0xac/0x180
   [811479be] do_dentry_open.isra.19+0x1ee/0x280
   [81147b3e] finish_open+0x1e/0x30
   [811578d2] do_last.isra.64+0x2c2/0xc40
   [81158519] path_openat.isra.65+0x2c9/0x490
   [81158c38] do_filp_open+0x38/0x80
   [81148cd4] do_sys_open+0xe4/0x1c0
   [81148dce] SyS_open+0x1e/0x20
   [8153e719] system_call_fastpath+0x16/0x1b
   [] 0x
  
   Here it looks like we are waiting in a wait queue inside
   rpc_wait_bit_killable() for RPC_TASK_ACTIVE.
  
   And there is a single task with a stack that looks like the following:
  
   [8107dc05] __refrigerator+0x55/0x150
   [814fd086] rpc_wait_bit_killable+0x66/0x80
   [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30
   [811dce5d] nfs4_run_open_task+0x11d/0x170
   [811de7a3] _nfs4_open_and_get_state+0x53/0x260
   [811e12d1] nfs4_do_open+0x121/0x400
   [811e15e1] nfs4_atomic_open+0x31/0x50
   [811f02dc] nfs4_file_open+0xac/0x180
   [811479be] do_dentry_open.isra.19+0x1ee/0x280
   [81147b3e] finish_open+0x1e/0x30
   [811578d2] do_last.isra.64+0x2c2/0xc40
   [81158519] path_openat.isra.65+0x2c9/0x490
   [81158c38] do_filp_open+0x38/0x80
   [81148cd4] do_sys_open+0xe4/0x1c0
   [81148dce] SyS_open+0x1e/0x20
   [8153e719] system_call_fastpath+0x16/0x1b
   [] 0x
  
   This looks similar but the different offset into
   rpc_wait_bit_killable() shows that we have returned from the
   schedule() call in freezable_schedule() and are now blocked in
   __refrigerator() inside freezer_count()
  
   Similarly if you look at the tasks that are NOT frozen but are stuck
   opening a NFS file, they also have the following stack showing they are
   waiting in the wait queue for RPC_TASK_ACTIVE.
  
   [814fd055] rpc_wait_bit_killable+0x35/0x80
   [814fd01d] __rpc_wait_for_completion_task+0x2d/0x30
   [811dce5d] nfs4_run_open_task+0x11d/0x170
   [811de7a3] _nfs4_open_and_get_state+0x53/0x260
   [811e12d1] nfs4_do_open+0x121/0x400
   [811e15e1] nfs4_atomic_open+0x31/0x50
   [811f02dc] nfs4_file_open+0xac/0x180
   [811479be] do_dentry_open.isra.19+0x1ee/0x280
   [81147b3e] finish_open+0x1e/0x30
   [811578d2] do_last.isra.64+0x2c2/0xc40
   [81158519] path_openat.isra.65+0x2c9/0x490
   [81158c38] do_filp_open+0x38/0x80
   [81148cd4] do_sys_open+0xe4/0x1c0
   [81148dce] SyS_open+0x1e/0x20
   [8153e719] system_call_fastpath+0x16/0x1b
   [] 0x
  
   We have hit this a couple of times now and know that if we THAW all of
   the frozen tasks that running tasks will unwedge and finish.
  
   Additionally we have also tried thawing the single task that is frozen
   in __refrigerator() inside rpc_wait_bit_killable().  This usually
   results in different frozen task entering the __refrigerator() state
   inside rpc_wait_bit_killable().  It looks like each one of those tasks
   must wake up another letting it progress.  Again if you thaw enough of
   the frozen tasks eventually everything unwedges and everything
   completes.
  
   I've looked through the 3.10 stable patches since 3.10.46 and don't
   see anything that looks like it addresses this.  Does anyone have any
   idea what might be going on here, and what the fix might be?
  
   Thanks,
   Shawn
 
  Hi Shawn, just started looking at this myself, and as Frank Sorensen points
  out in https://bugzilla.redhat.com/show_bug.cgi?id=1209143 the problem is
  that a task takes the xprt lock and then ends up in the refrigerator
  effectively blocking other tasks from proceeding.
 
  Jeff, any suggestions on how to proceed here?
 
 Sorry for the noise, and self

Re: HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
On Wed, Apr 08, 2015 at 02:16:05PM -0700, Mike Kravetz wrote:
> On 04/08/2015 09:15 AM, Shawn Bohrer wrote:
> >I've noticed on a number of my systems that after shutting down my
> >application that uses huge pages that I'm left with some pages still
> >in HugePages_Rsvd.  It is possible that I still have something using
> >huge pages that I'm not aware of but so far my attempts to find
> >anything using huge pages have failed.  I've run some simple tests
> >using map_hugetlb.c from the kernel source and can see that pages that
> >have been reserved but not allocated still show up in
> >/proc//smaps and /proc//numa_maps.  Are there any cases
> >where this is not true?
> 
> Just a quick question.  Are you using hugetlb filesystem(s)?

I can't say for sure that nothing is using hugetlbfs. It is mounted
but as far as I can tell on the affected system(s) it is empty.

[root@dev106 ~]# grep hugetlbfs /proc/mounts
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
[root@dev106 ~]# ls -al /dev/hugepages/
total 0
drwxr-xr-x  2 root root0 Apr  8 16:22 .
drwxr-xr-x 16 root root 4360 Apr  8 03:53 ..
[root@dev106 ~]# lsof | grep hugepages

> If so, you might want to take a look at files residing in the
> filesystem(s).  As an experiment, I had a program do a simple
> mmap() of a file in a hugetlb filesystem.  The program just
> created the mapping, and did not actually fault/allocate any
> huge pages.  The result was the reservation (HugePages_Rsvd)
> of sufficient huge pages to cover the mapping.  When the program
> exited, the reservations remained.  If I remove (unlink) the
> file the reservations will be removed.

That makes sense but I don't think it is the issue here.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
On Wed, Apr 08, 2015 at 12:29:03PM -0700, Davidlohr Bueso wrote:
> On Wed, 2015-04-08 at 11:15 -0500, Shawn Bohrer wrote:
> > AnonHugePages:241664 kB
> > HugePages_Total: 512
> > HugePages_Free:  512
> > HugePages_Rsvd:  384
> > HugePages_Surp:0
> > Hugepagesize:   2048 kB
> > 
> > So here I have 384 pages reserved and I can't find anything that is
> > using them. 
> 
> The output clearly shows all available hugepages are free, Why are you
> assuming that reserved implies allocated/in use? This is not true,
> please read one of the millions of docs out there -- you can start with:
> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
 
As that fine document states:

HugePages_Rsvd  is short for "reserved," and is the number of huge pages for
which a commitment to allocate from the pool has been made,
but no allocation has yet been made.  Reserved huge pages
guarantee that an application will be able to allocate a
huge page from the pool of huge pages at fault time.

Thus in my example above while I have 512 pages free 384 are reserved
and therefore if a new application comes along it can only reserve/use
the remaining 128 pages.

For example:

[scratch]$ grep Huge /proc/meminfo 
AnonHugePages: 0 kB
HugePages_Total:   1
HugePages_Free:1
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB

[scratch]$ cat map_hugetlb.c
#include 
#include 
#include 
#include 

#define LENGTH (2UL*1024*1024)
#define PROTECTION (PROT_READ | PROT_WRITE)
#define ADDR (void *)(0x0UL)
#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)

int main(void)
{
void *addr;
addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}

getchar();

munmap(addr, LENGTH);
return 0;
}

[scratch]$ make map_hugetlb
cc map_hugetlb.c   -o map_hugetlb

[scratch]$ ./map_hugetlb &
[1] 7359
[1]+  Stopped ./map_hugetlb

[scratch]$ grep Huge /proc/meminfo 
AnonHugePages: 0 kB
HugePages_Total:   1
HugePages_Free:1
HugePages_Rsvd:1
HugePages_Surp:0
Hugepagesize:   2048 kB

[scratch]$ ./map_hugetlb
mmap: Cannot allocate memory


As you can see I still have 1 huge page free but that one huge page is
reserved by PID 7358.  If I then try to run a new map_hugetlb process
the mmap fails because even though I have 1 page free it is reserved.

Furthermore we can find that 7358 has that page in the following ways:

[scratch]$ sudo grep "KernelPageSize:.*2048" /proc/*/smaps
/proc/7359/smaps:KernelPageSize: 2048 kB
[scratch]$ sudo grep "VmFlags:.*ht" /proc/*/smaps
/proc/7359/smaps:VmFlags: rd wr mr mw me de ht sd
[scratch]$ sudo grep -w huge /proc/*/numa_maps
/proc/7359/numa_maps:7f323300 default file=/anon_hugepage\040(deleted) huge

Which leads back to my original question.  I have machines that have a
non-zero HugePages_Rsvd count but I cannot find any processes that
seem to have those pages reserved using the three methods shown above.
Is there some other way to identify which process has those pages
reserved?  Or is there possibly a leak which is failing to decrement
the reserve count?

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
I've noticed on a number of my systems that after shutting down my
application that uses huge pages that I'm left with some pages still
in HugePages_Rsvd.  It is possible that I still have something using
huge pages that I'm not aware of but so far my attempts to find
anything using huge pages have failed.  I've run some simple tests
using map_hugetlb.c from the kernel source and can see that pages that
have been reserved but not allocated still show up in
/proc//smaps and /proc//numa_maps.  Are there any cases
where this is not true?

[root@dev106 ~]# grep HugePages /proc/meminfo
AnonHugePages:241664 kB
HugePages_Total: 512
HugePages_Free:  512
HugePages_Rsvd:  384
HugePages_Surp:0
Hugepagesize:   2048 kB
[root@dev106 ~]# grep "KernelPageSize:.*2048" /proc/*/smaps
[root@dev106 ~]# grep "VmFlags:.*ht" /proc/*/smaps
[root@dev106 ~]# grep huge /proc/*/numa_maps
[root@dev106 ~]# grep Huge /proc/meminfo
AnonHugePages:241664 kB
HugePages_Total: 512
HugePages_Free:  512
HugePages_Rsvd:  384
HugePages_Surp:0
Hugepagesize:   2048 kB

So here I have 384 pages reserved and I can't find anything that is
using them.  This is on a machine running 3.14.33.  I can possibly try
running a newer kernel if there is a belief that this has been fixed.
I'm also happy to provide more information or try some debug patches
if there are ideas on how to track this down.  I'm not entirely sure
how hard this is to reproduce but nearly every machine I've looked at
is in this state so it must not be too hard.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
I've noticed on a number of my systems that after shutting down my
application that uses huge pages that I'm left with some pages still
in HugePages_Rsvd.  It is possible that I still have something using
huge pages that I'm not aware of but so far my attempts to find
anything using huge pages have failed.  I've run some simple tests
using map_hugetlb.c from the kernel source and can see that pages that
have been reserved but not allocated still show up in
/proc/pid/smaps and /proc/pid/numa_maps.  Are there any cases
where this is not true?

[root@dev106 ~]# grep HugePages /proc/meminfo
AnonHugePages:241664 kB
HugePages_Total: 512
HugePages_Free:  512
HugePages_Rsvd:  384
HugePages_Surp:0
Hugepagesize:   2048 kB
[root@dev106 ~]# grep KernelPageSize:.*2048 /proc/*/smaps
[root@dev106 ~]# grep VmFlags:.*ht /proc/*/smaps
[root@dev106 ~]# grep huge /proc/*/numa_maps
[root@dev106 ~]# grep Huge /proc/meminfo
AnonHugePages:241664 kB
HugePages_Total: 512
HugePages_Free:  512
HugePages_Rsvd:  384
HugePages_Surp:0
Hugepagesize:   2048 kB

So here I have 384 pages reserved and I can't find anything that is
using them.  This is on a machine running 3.14.33.  I can possibly try
running a newer kernel if there is a belief that this has been fixed.
I'm also happy to provide more information or try some debug patches
if there are ideas on how to track this down.  I'm not entirely sure
how hard this is to reproduce but nearly every machine I've looked at
is in this state so it must not be too hard.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
On Wed, Apr 08, 2015 at 12:29:03PM -0700, Davidlohr Bueso wrote:
 On Wed, 2015-04-08 at 11:15 -0500, Shawn Bohrer wrote:
  AnonHugePages:241664 kB
  HugePages_Total: 512
  HugePages_Free:  512
  HugePages_Rsvd:  384
  HugePages_Surp:0
  Hugepagesize:   2048 kB
  
  So here I have 384 pages reserved and I can't find anything that is
  using them. 
 
 The output clearly shows all available hugepages are free, Why are you
 assuming that reserved implies allocated/in use? This is not true,
 please read one of the millions of docs out there -- you can start with:
 https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
 
As that fine document states:

HugePages_Rsvd  is short for reserved, and is the number of huge pages for
which a commitment to allocate from the pool has been made,
but no allocation has yet been made.  Reserved huge pages
guarantee that an application will be able to allocate a
huge page from the pool of huge pages at fault time.

Thus in my example above while I have 512 pages free 384 are reserved
and therefore if a new application comes along it can only reserve/use
the remaining 128 pages.

For example:

[scratch]$ grep Huge /proc/meminfo 
AnonHugePages: 0 kB
HugePages_Total:   1
HugePages_Free:1
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB

[scratch]$ cat map_hugetlb.c
#include stdlib.h
#include stdio.h
#include unistd.h
#include sys/mman.h

#define LENGTH (2UL*1024*1024)
#define PROTECTION (PROT_READ | PROT_WRITE)
#define ADDR (void *)(0x0UL)
#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)

int main(void)
{
void *addr;
addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
if (addr == MAP_FAILED) {
perror(mmap);
exit(1);
}

getchar();

munmap(addr, LENGTH);
return 0;
}

[scratch]$ make map_hugetlb
cc map_hugetlb.c   -o map_hugetlb

[scratch]$ ./map_hugetlb 
[1] 7359
[1]+  Stopped ./map_hugetlb

[scratch]$ grep Huge /proc/meminfo 
AnonHugePages: 0 kB
HugePages_Total:   1
HugePages_Free:1
HugePages_Rsvd:1
HugePages_Surp:0
Hugepagesize:   2048 kB

[scratch]$ ./map_hugetlb
mmap: Cannot allocate memory


As you can see I still have 1 huge page free but that one huge page is
reserved by PID 7358.  If I then try to run a new map_hugetlb process
the mmap fails because even though I have 1 page free it is reserved.

Furthermore we can find that 7358 has that page in the following ways:

[scratch]$ sudo grep KernelPageSize:.*2048 /proc/*/smaps
/proc/7359/smaps:KernelPageSize: 2048 kB
[scratch]$ sudo grep VmFlags:.*ht /proc/*/smaps
/proc/7359/smaps:VmFlags: rd wr mr mw me de ht sd
[scratch]$ sudo grep -w huge /proc/*/numa_maps
/proc/7359/numa_maps:7f323300 default file=/anon_hugepage\040(deleted) huge

Which leads back to my original question.  I have machines that have a
non-zero HugePages_Rsvd count but I cannot find any processes that
seem to have those pages reserved using the three methods shown above.
Is there some other way to identify which process has those pages
reserved?  Or is there possibly a leak which is failing to decrement
the reserve count?

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: HugePages_Rsvd leak

2015-04-08 Thread Shawn Bohrer
On Wed, Apr 08, 2015 at 02:16:05PM -0700, Mike Kravetz wrote:
 On 04/08/2015 09:15 AM, Shawn Bohrer wrote:
 I've noticed on a number of my systems that after shutting down my
 application that uses huge pages that I'm left with some pages still
 in HugePages_Rsvd.  It is possible that I still have something using
 huge pages that I'm not aware of but so far my attempts to find
 anything using huge pages have failed.  I've run some simple tests
 using map_hugetlb.c from the kernel source and can see that pages that
 have been reserved but not allocated still show up in
 /proc/pid/smaps and /proc/pid/numa_maps.  Are there any cases
 where this is not true?
 
 Just a quick question.  Are you using hugetlb filesystem(s)?

I can't say for sure that nothing is using hugetlbfs. It is mounted
but as far as I can tell on the affected system(s) it is empty.

[root@dev106 ~]# grep hugetlbfs /proc/mounts
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
[root@dev106 ~]# ls -al /dev/hugepages/
total 0
drwxr-xr-x  2 root root0 Apr  8 16:22 .
drwxr-xr-x 16 root root 4360 Apr  8 03:53 ..
[root@dev106 ~]# lsof | grep hugepages

 If so, you might want to take a look at files residing in the
 filesystem(s).  As an experiment, I had a program do a simple
 mmap() of a file in a hugetlb filesystem.  The program just
 created the mapping, and did not actually fault/allocate any
 huge pages.  The result was the reservation (HugePages_Rsvd)
 of sufficient huge pages to cover the mapping.  When the program
 exited, the reservations remained.  If I remove (unlink) the
 file the reservations will be removed.

That makes sense but I don't think it is the issue here.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NFS Freezer and stuck tasks

2015-03-04 Thread Shawn Bohrer
Hello,

We're using the Linux cgroup Freezer on some machines that use NFS and
have run into what appears to be a bug where frozen tasks are blocking
running tasks and preventing them from completing.  On one of our
machines which happens to be running an older 3.10.46 kernel we have
frozen some of the tasks on the system using the cgroup Freezer.  We
also have a separate set of tasks which are NOT frozen which are stuck
trying to open some files on NFS.

Looking at the frozen tasks there are several that have the following
stack:

[] rpc_wait_bit_killable+0x35/0x80
[] __rpc_wait_for_completion_task+0x2d/0x30
[] nfs4_run_open_task+0x11d/0x170
[] _nfs4_open_and_get_state+0x53/0x260
[] nfs4_do_open+0x121/0x400
[] nfs4_atomic_open+0x31/0x50
[] nfs4_file_open+0xac/0x180
[] do_dentry_open.isra.19+0x1ee/0x280
[] finish_open+0x1e/0x30
[] do_last.isra.64+0x2c2/0xc40
[] path_openat.isra.65+0x2c9/0x490
[] do_filp_open+0x38/0x80
[] do_sys_open+0xe4/0x1c0
[] SyS_open+0x1e/0x20
[] system_call_fastpath+0x16/0x1b
[] 0x

Here it looks like we are waiting in a wait queue inside
rpc_wait_bit_killable() for RPC_TASK_ACTIVE.

And there is a single task with a stack that looks like the following:

[] __refrigerator+0x55/0x150
[] rpc_wait_bit_killable+0x66/0x80
[] __rpc_wait_for_completion_task+0x2d/0x30
[] nfs4_run_open_task+0x11d/0x170
[] _nfs4_open_and_get_state+0x53/0x260
[] nfs4_do_open+0x121/0x400
[] nfs4_atomic_open+0x31/0x50
[] nfs4_file_open+0xac/0x180
[] do_dentry_open.isra.19+0x1ee/0x280
[] finish_open+0x1e/0x30
[] do_last.isra.64+0x2c2/0xc40
[] path_openat.isra.65+0x2c9/0x490
[] do_filp_open+0x38/0x80
[] do_sys_open+0xe4/0x1c0
[] SyS_open+0x1e/0x20
[] system_call_fastpath+0x16/0x1b
[] 0x

This looks similar but the different offset into
rpc_wait_bit_killable() shows that we have returned from the
schedule() call in freezable_schedule() and are now blocked in
__refrigerator() inside freezer_count()

Similarly if you look at the tasks that are NOT frozen but are stuck
opening a NFS file, they also have the following stack showing they are
waiting in the wait queue for RPC_TASK_ACTIVE.

[] rpc_wait_bit_killable+0x35/0x80
[] __rpc_wait_for_completion_task+0x2d/0x30
[] nfs4_run_open_task+0x11d/0x170
[] _nfs4_open_and_get_state+0x53/0x260
[] nfs4_do_open+0x121/0x400
[] nfs4_atomic_open+0x31/0x50
[] nfs4_file_open+0xac/0x180
[] do_dentry_open.isra.19+0x1ee/0x280
[] finish_open+0x1e/0x30
[] do_last.isra.64+0x2c2/0xc40
[] path_openat.isra.65+0x2c9/0x490
[] do_filp_open+0x38/0x80
[] do_sys_open+0xe4/0x1c0
[] SyS_open+0x1e/0x20
[] system_call_fastpath+0x16/0x1b
[] 0x

We have hit this a couple of times now and know that if we THAW all of
the frozen tasks that running tasks will unwedge and finish.

Additionally we have also tried thawing the single task that is frozen
in __refrigerator() inside rpc_wait_bit_killable().  This usually
results in different frozen task entering the __refrigerator() state
inside rpc_wait_bit_killable().  It looks like each one of those tasks
must wake up another letting it progress.  Again if you thaw enough of
the frozen tasks eventually everything unwedges and everything
completes.

I've looked through the 3.10 stable patches since 3.10.46 and don't
see anything that looks like it addresses this.  Does anyone have any
idea what might be going on here, and what the fix might be?

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NFS Freezer and stuck tasks

2015-03-04 Thread Shawn Bohrer
Hello,

We're using the Linux cgroup Freezer on some machines that use NFS and
have run into what appears to be a bug where frozen tasks are blocking
running tasks and preventing them from completing.  On one of our
machines which happens to be running an older 3.10.46 kernel we have
frozen some of the tasks on the system using the cgroup Freezer.  We
also have a separate set of tasks which are NOT frozen which are stuck
trying to open some files on NFS.

Looking at the frozen tasks there are several that have the following
stack:

[814fd055] rpc_wait_bit_killable+0x35/0x80
[814fd01d] __rpc_wait_for_completion_task+0x2d/0x30
[811dce5d] nfs4_run_open_task+0x11d/0x170
[811de7a3] _nfs4_open_and_get_state+0x53/0x260
[811e12d1] nfs4_do_open+0x121/0x400
[811e15e1] nfs4_atomic_open+0x31/0x50
[811f02dc] nfs4_file_open+0xac/0x180
[811479be] do_dentry_open.isra.19+0x1ee/0x280
[81147b3e] finish_open+0x1e/0x30
[811578d2] do_last.isra.64+0x2c2/0xc40
[81158519] path_openat.isra.65+0x2c9/0x490
[81158c38] do_filp_open+0x38/0x80
[81148cd4] do_sys_open+0xe4/0x1c0
[81148dce] SyS_open+0x1e/0x20
[8153e719] system_call_fastpath+0x16/0x1b
[] 0x

Here it looks like we are waiting in a wait queue inside
rpc_wait_bit_killable() for RPC_TASK_ACTIVE.

And there is a single task with a stack that looks like the following:

[8107dc05] __refrigerator+0x55/0x150
[814fd086] rpc_wait_bit_killable+0x66/0x80
[814fd01d] __rpc_wait_for_completion_task+0x2d/0x30
[811dce5d] nfs4_run_open_task+0x11d/0x170
[811de7a3] _nfs4_open_and_get_state+0x53/0x260
[811e12d1] nfs4_do_open+0x121/0x400
[811e15e1] nfs4_atomic_open+0x31/0x50
[811f02dc] nfs4_file_open+0xac/0x180
[811479be] do_dentry_open.isra.19+0x1ee/0x280
[81147b3e] finish_open+0x1e/0x30
[811578d2] do_last.isra.64+0x2c2/0xc40
[81158519] path_openat.isra.65+0x2c9/0x490
[81158c38] do_filp_open+0x38/0x80
[81148cd4] do_sys_open+0xe4/0x1c0
[81148dce] SyS_open+0x1e/0x20
[8153e719] system_call_fastpath+0x16/0x1b
[] 0x

This looks similar but the different offset into
rpc_wait_bit_killable() shows that we have returned from the
schedule() call in freezable_schedule() and are now blocked in
__refrigerator() inside freezer_count()

Similarly if you look at the tasks that are NOT frozen but are stuck
opening a NFS file, they also have the following stack showing they are
waiting in the wait queue for RPC_TASK_ACTIVE.

[814fd055] rpc_wait_bit_killable+0x35/0x80
[814fd01d] __rpc_wait_for_completion_task+0x2d/0x30
[811dce5d] nfs4_run_open_task+0x11d/0x170
[811de7a3] _nfs4_open_and_get_state+0x53/0x260
[811e12d1] nfs4_do_open+0x121/0x400
[811e15e1] nfs4_atomic_open+0x31/0x50
[811f02dc] nfs4_file_open+0xac/0x180
[811479be] do_dentry_open.isra.19+0x1ee/0x280
[81147b3e] finish_open+0x1e/0x30
[811578d2] do_last.isra.64+0x2c2/0xc40
[81158519] path_openat.isra.65+0x2c9/0x490
[81158c38] do_filp_open+0x38/0x80
[81148cd4] do_sys_open+0xe4/0x1c0
[81148dce] SyS_open+0x1e/0x20
[8153e719] system_call_fastpath+0x16/0x1b
[] 0x

We have hit this a couple of times now and know that if we THAW all of
the frozen tasks that running tasks will unwedge and finish.

Additionally we have also tried thawing the single task that is frozen
in __refrigerator() inside rpc_wait_bit_killable().  This usually
results in different frozen task entering the __refrigerator() state
inside rpc_wait_bit_killable().  It looks like each one of those tasks
must wake up another letting it progress.  Again if you thaw enough of
the frozen tasks eventually everything unwedges and everything
completes.

I've looked through the 3.10 stable patches since 3.10.46 and don't
see anything that looks like it addresses this.  Does anyone have any
idea what might be going on here, and what the fix might be?

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-09-15 Thread Shawn Bohrer
On Wed, Sep 03, 2014 at 12:13:57PM -0500, Shawn Bohrer wrote:
> From: Shawn Bohrer 
> 
> In debugging an application that receives -ENOMEM from ib_reg_mr() I
> found that ib_umem_get() can fail because the pinned_vm count has
> wrapped causing it to always be larger than the lock limit even with
> RLIMIT_MEMLOCK set to RLIM_INFINITY.
> 
> The wrapping of pinned_vm occurs because the process that calls
> ib_reg_mr() will have its mm->pinned_vm count incremented.  Later a
> different process with a different mm_struct than the one that allocated
> the ib_umem struct ends up releasing it which results in decrementing
> the new processes mm->pinned_vm count past zero and wrapping.
> 
> I'm not entirely sure what circumstances cause a different process to
> release the ib_umem than the one that allocated it but the kernel stack
> trace of the freeing process from my situation looks like the following:
> 
> Call Trace:
>  [] dump_stack+0x19/0x1b
>  [] ib_umem_release+0x1f5/0x200 [ib_core]
>  [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
>  [] ib_destroy_qp+0x12c/0x170 [ib_core]
>  [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
>  [] __fput+0xba/0x240
>  [] fput+0xe/0x10
>  [] task_work_run+0xc4/0xe0
>  [] do_notify_resume+0x95/0xa0
>  [] int_signal+0x12/0x17
> 
> The following patch fixes the issue by storing the pid struct of the
> process that calls ib_umem_get() so that ib_umem_release and/or
> ib_umem_account() can properly decrement the pinned_vm count of the
> correct mm_struct.
> 
> Signed-off-by: Shawn Bohrer 
> ---
> v3 changes:
> * Fix resource leak with put_task_struct()
> v2 changes:
> * Updated to use get_task_pid to avoid keeping a reference to the mm
> 
>  drivers/infiniband/core/umem.c |   19 +--
>  include/rdma/ib_umem.h |1 +
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index a3a2e9c..df0c4f6 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
> unsigned long addr,
>   umem->length= size;
>   umem->offset= addr & ~PAGE_MASK;
>   umem->page_size = PAGE_SIZE;
> + umem->pid   = get_task_pid(current, PIDTYPE_PID);
>   /*
>* We ask for writable memory if any access flags other than
>* "remote read" are set.  "Local write" and "remote write"
> @@ -198,6 +199,7 @@ out:
>   if (ret < 0) {
>   if (need_release)
>   __ib_umem_release(context->device, umem, 0);
> + put_pid(umem->pid);
>   kfree(umem);
>   } else
>   current->mm->pinned_vm = locked;
> @@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem)
>  {
>   struct ib_ucontext *context = umem->context;
>   struct mm_struct *mm;
> + struct task_struct *task;
>   unsigned long diff;
>  
>   __ib_umem_release(umem->context->device, umem, 1);
>  
> - mm = get_task_mm(current);
> - if (!mm) {
> - kfree(umem);
> - return;
> - }
> + task = get_pid_task(umem->pid, PIDTYPE_PID);
> + put_pid(umem->pid);
> + if (!task)
> + goto out;
> + mm = get_task_mm(task);
> + put_task_struct(task);
> + if (!mm)
> + goto out;
>  
>   diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
>  
> @@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem)
>   } else
>   down_write(>mmap_sem);
>  
> - current->mm->pinned_vm -= diff;
> + mm->pinned_vm -= diff;
>   up_write(>mmap_sem);
>   mmput(mm);
> +out:
>   kfree(umem);
>  }
>  EXPORT_SYMBOL(ib_umem_release);
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 1ea0b65..a2bf41e 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -47,6 +47,7 @@ struct ib_umem {
>   int writable;
>   int hugetlb;
>   struct work_struct  work;
> + struct pid *pid;
>   struct mm_struct   *mm;
>   unsigned long   diff;
>   struct sg_table sg_head;
> -- 
> 1.7.7.6

Hi Roland,

I haven't seen any additional review feedback, and it doesn't appear
that this patch has made its way into any of your infiniband trees
yet.  Is there anything holding this up?

We've been running this patch on top of 3.10 since I originally sent
this and have not encountered any issues so far.

--
Shawn
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-09-15 Thread Shawn Bohrer
On Wed, Sep 03, 2014 at 12:13:57PM -0500, Shawn Bohrer wrote:
 From: Shawn Bohrer sboh...@rgmadvisors.com
 
 In debugging an application that receives -ENOMEM from ib_reg_mr() I
 found that ib_umem_get() can fail because the pinned_vm count has
 wrapped causing it to always be larger than the lock limit even with
 RLIMIT_MEMLOCK set to RLIM_INFINITY.
 
 The wrapping of pinned_vm occurs because the process that calls
 ib_reg_mr() will have its mm-pinned_vm count incremented.  Later a
 different process with a different mm_struct than the one that allocated
 the ib_umem struct ends up releasing it which results in decrementing
 the new processes mm-pinned_vm count past zero and wrapping.
 
 I'm not entirely sure what circumstances cause a different process to
 release the ib_umem than the one that allocated it but the kernel stack
 trace of the freeing process from my situation looks like the following:
 
 Call Trace:
  [814d64b1] dump_stack+0x19/0x1b
  [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core]
  [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
  [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core]
  [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
  [81141cba] __fput+0xba/0x240
  [81141e4e] fput+0xe/0x10
  [81060894] task_work_run+0xc4/0xe0
  [810029e5] do_notify_resume+0x95/0xa0
  [814e3dd0] int_signal+0x12/0x17
 
 The following patch fixes the issue by storing the pid struct of the
 process that calls ib_umem_get() so that ib_umem_release and/or
 ib_umem_account() can properly decrement the pinned_vm count of the
 correct mm_struct.
 
 Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
 ---
 v3 changes:
 * Fix resource leak with put_task_struct()
 v2 changes:
 * Updated to use get_task_pid to avoid keeping a reference to the mm
 
  drivers/infiniband/core/umem.c |   19 +--
  include/rdma/ib_umem.h |1 +
  2 files changed, 14 insertions(+), 6 deletions(-)
 
 diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
 index a3a2e9c..df0c4f6 100644
 --- a/drivers/infiniband/core/umem.c
 +++ b/drivers/infiniband/core/umem.c
 @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
 unsigned long addr,
   umem-length= size;
   umem-offset= addr  ~PAGE_MASK;
   umem-page_size = PAGE_SIZE;
 + umem-pid   = get_task_pid(current, PIDTYPE_PID);
   /*
* We ask for writable memory if any access flags other than
* remote read are set.  Local write and remote write
 @@ -198,6 +199,7 @@ out:
   if (ret  0) {
   if (need_release)
   __ib_umem_release(context-device, umem, 0);
 + put_pid(umem-pid);
   kfree(umem);
   } else
   current-mm-pinned_vm = locked;
 @@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem)
  {
   struct ib_ucontext *context = umem-context;
   struct mm_struct *mm;
 + struct task_struct *task;
   unsigned long diff;
  
   __ib_umem_release(umem-context-device, umem, 1);
  
 - mm = get_task_mm(current);
 - if (!mm) {
 - kfree(umem);
 - return;
 - }
 + task = get_pid_task(umem-pid, PIDTYPE_PID);
 + put_pid(umem-pid);
 + if (!task)
 + goto out;
 + mm = get_task_mm(task);
 + put_task_struct(task);
 + if (!mm)
 + goto out;
  
   diff = PAGE_ALIGN(umem-length + umem-offset)  PAGE_SHIFT;
  
 @@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem)
   } else
   down_write(mm-mmap_sem);
  
 - current-mm-pinned_vm -= diff;
 + mm-pinned_vm -= diff;
   up_write(mm-mmap_sem);
   mmput(mm);
 +out:
   kfree(umem);
  }
  EXPORT_SYMBOL(ib_umem_release);
 diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
 index 1ea0b65..a2bf41e 100644
 --- a/include/rdma/ib_umem.h
 +++ b/include/rdma/ib_umem.h
 @@ -47,6 +47,7 @@ struct ib_umem {
   int writable;
   int hugetlb;
   struct work_struct  work;
 + struct pid *pid;
   struct mm_struct   *mm;
   unsigned long   diff;
   struct sg_table sg_head;
 -- 
 1.7.7.6

Hi Roland,

I haven't seen any additional review feedback, and it doesn't appear
that this patch has made its way into any of your infiniband trees
yet.  Is there anything holding this up?

We've been running this patch on top of 3.10 since I originally sent
this and have not encountered any issues so far.

--
Shawn
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-09-03 Thread Shawn Bohrer
From: Shawn Bohrer 

In debugging an application that receives -ENOMEM from ib_reg_mr() I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.

The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm->pinned_vm count incremented.  Later a
different process with a different mm_struct than the one that allocated
the ib_umem struct ends up releasing it which results in decrementing
the new processes mm->pinned_vm count past zero and wrapping.

I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel stack
trace of the freeing process from my situation looks like the following:

Call Trace:
 [] dump_stack+0x19/0x1b
 [] ib_umem_release+0x1f5/0x200 [ib_core]
 [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
 [] ib_destroy_qp+0x12c/0x170 [ib_core]
 [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
 [] __fput+0xba/0x240
 [] fput+0xe/0x10
 [] task_work_run+0xc4/0xe0
 [] do_notify_resume+0x95/0xa0
 [] int_signal+0x12/0x17

The following patch fixes the issue by storing the pid struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.

Signed-off-by: Shawn Bohrer 
---
v3 changes:
* Fix resource leak with put_task_struct()
v2 changes:
* Updated to use get_task_pid to avoid keeping a reference to the mm

 drivers/infiniband/core/umem.c |   19 +--
 include/rdma/ib_umem.h |1 +
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..df0c4f6 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->length= size;
umem->offset= addr & ~PAGE_MASK;
umem->page_size = PAGE_SIZE;
+   umem->pid   = get_task_pid(current, PIDTYPE_PID);
/*
 * We ask for writable memory if any access flags other than
 * "remote read" are set.  "Local write" and "remote write"
@@ -198,6 +199,7 @@ out:
if (ret < 0) {
if (need_release)
__ib_umem_release(context->device, umem, 0);
+   put_pid(umem->pid);
kfree(umem);
} else
current->mm->pinned_vm = locked;
@@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem)
 {
struct ib_ucontext *context = umem->context;
struct mm_struct *mm;
+   struct task_struct *task;
unsigned long diff;
 
__ib_umem_release(umem->context->device, umem, 1);
 
-   mm = get_task_mm(current);
-   if (!mm) {
-   kfree(umem);
-   return;
-   }
+   task = get_pid_task(umem->pid, PIDTYPE_PID);
+   put_pid(umem->pid);
+   if (!task)
+   goto out;
+   mm = get_task_mm(task);
+   put_task_struct(task);
+   if (!mm)
+   goto out;
 
diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
 
@@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem)
} else
down_write(>mmap_sem);
 
-   current->mm->pinned_vm -= diff;
+   mm->pinned_vm -= diff;
up_write(>mmap_sem);
mmput(mm);
+out:
kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1ea0b65..a2bf41e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -47,6 +47,7 @@ struct ib_umem {
int writable;
int hugetlb;
struct work_struct  work;
+   struct pid *pid;
struct mm_struct   *mm;
unsigned long   diff;
struct sg_table sg_head;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-09-03 Thread Shawn Bohrer
From: Shawn Bohrer sboh...@rgmadvisors.com

In debugging an application that receives -ENOMEM from ib_reg_mr() I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.

The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm-pinned_vm count incremented.  Later a
different process with a different mm_struct than the one that allocated
the ib_umem struct ends up releasing it which results in decrementing
the new processes mm-pinned_vm count past zero and wrapping.

I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel stack
trace of the freeing process from my situation looks like the following:

Call Trace:
 [814d64b1] dump_stack+0x19/0x1b
 [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core]
 [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
 [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core]
 [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
 [81141cba] __fput+0xba/0x240
 [81141e4e] fput+0xe/0x10
 [81060894] task_work_run+0xc4/0xe0
 [810029e5] do_notify_resume+0x95/0xa0
 [814e3dd0] int_signal+0x12/0x17

The following patch fixes the issue by storing the pid struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
---
v3 changes:
* Fix resource leak with put_task_struct()
v2 changes:
* Updated to use get_task_pid to avoid keeping a reference to the mm

 drivers/infiniband/core/umem.c |   19 +--
 include/rdma/ib_umem.h |1 +
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..df0c4f6 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem-length= size;
umem-offset= addr  ~PAGE_MASK;
umem-page_size = PAGE_SIZE;
+   umem-pid   = get_task_pid(current, PIDTYPE_PID);
/*
 * We ask for writable memory if any access flags other than
 * remote read are set.  Local write and remote write
@@ -198,6 +199,7 @@ out:
if (ret  0) {
if (need_release)
__ib_umem_release(context-device, umem, 0);
+   put_pid(umem-pid);
kfree(umem);
} else
current-mm-pinned_vm = locked;
@@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem)
 {
struct ib_ucontext *context = umem-context;
struct mm_struct *mm;
+   struct task_struct *task;
unsigned long diff;
 
__ib_umem_release(umem-context-device, umem, 1);
 
-   mm = get_task_mm(current);
-   if (!mm) {
-   kfree(umem);
-   return;
-   }
+   task = get_pid_task(umem-pid, PIDTYPE_PID);
+   put_pid(umem-pid);
+   if (!task)
+   goto out;
+   mm = get_task_mm(task);
+   put_task_struct(task);
+   if (!mm)
+   goto out;
 
diff = PAGE_ALIGN(umem-length + umem-offset)  PAGE_SHIFT;
 
@@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem)
} else
down_write(mm-mmap_sem);
 
-   current-mm-pinned_vm -= diff;
+   mm-pinned_vm -= diff;
up_write(mm-mmap_sem);
mmput(mm);
+out:
kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1ea0b65..a2bf41e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -47,6 +47,7 @@ struct ib_umem {
int writable;
int hugetlb;
struct work_struct  work;
+   struct pid *pid;
struct mm_struct   *mm;
unsigned long   diff;
struct sg_table sg_head;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-28 Thread Shawn Bohrer
On Thu, Aug 28, 2014 at 02:48:19PM +0300, Haggai Eran wrote:
> On 26/08/2014 00:07, Shawn Bohrer wrote:
> >>>> The following patch fixes the issue by storing the mm_struct of the
> >> > 
> >> > You are doing more than just storing the mm_struct - you are taking
> >> > a reference to the process' mm.  This can lead to a massive resource
> >> > leakage. The reason is bit complex: The destruction flow for IB
> >> > uverbs is based upon releasing the file handle for it. Once the file
> >> > handle is released, all MRs, QPs, CQs, PDs, etc. that the process
> >> > allocated are released.  For the kernel to release the file handle,
> >> > the kernel reference count to it needs to reach zero.  Most IB
> >> > implementations expose some hardware registers to the application by
> >> > allowing it to mmap the uverbs device file.  This mmap takes a
> >> > reference to uverbs device file handle that the application opened.
> >> > This reference is dropped when the process mm is released during the
> >> > process destruction.  Your code takes a reference to the mm that
> >> > will only be released when the parent MR/QP is released.
> >> > 
> >> > Now, we have a deadlock - the mm is waiting for the MR to be
> >> > destroyed, the MR is waiting for the file handle to be destroyed,
> >> > and the file handle is waiting for the mm to be destroyed.
> >> > 
> >> > The proper solution is to keep a reference to the task_pid (using
> >> > get_task_pid), and use this pid to get the task_struct and from it
> >> > the mm_struct during the destruction flow.
> >  
> > I'll put together a patch using get_task_pid() and see if I can
> > test/reproduce the issue.  This may take a couple of days since we
> > have to test this in production at the moment.
> >  
> 
> Hi,
> 
> I just wanted to point out that while working on the on demand paging patches
> we also needed to keep a reference to the task pid (to make sure we always 
> handle page faults on behalf of the correct mm struct). You can find the 
> relevant code in the patch titled "IB/core: Add support for on demand paging 
> regions" [1].
> 
> Haggai
> 
> [1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20552.html

Haggai,

I looked over the on demand paging patch and I'm not sure if you are
suggesting that it already fixes my issue, or that I should use it as
a reference for my code.  In any case I just sent a v2 of a patch that
appears to fix my issue.

--
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-28 Thread Shawn Bohrer
From: Shawn Bohrer 

In debugging an application that receives -ENOMEM from ib_reg_mr() I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.

The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm->pinned_vm count incremented.  Later a
different process with a different mm_struct than the one that allocated
the ib_umem struct ends up releasing it which results in decrementing
the new processes mm->pinned_vm count past zero and wrapping.

I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel stack
trace of the freeing process from my situation looks like the following:

Call Trace:
 [] dump_stack+0x19/0x1b
 [] ib_umem_release+0x1f5/0x200 [ib_core]
 [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
 [] ib_destroy_qp+0x12c/0x170 [ib_core]
 [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
 [] __fput+0xba/0x240
 [] fput+0xe/0x10
 [] task_work_run+0xc4/0xe0
 [] do_notify_resume+0x95/0xa0
 [] int_signal+0x12/0x17

The following patch fixes the issue by storing the pid struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.

Signed-off-by: Shawn Bohrer 
---

v2 changes:
* Updated to use get_task_pid to avoid keeping a reference to the mm

I've run this patch on our test pool for general testing for a few days
and today verified that it solves the reported issue above on our
production machines.


 drivers/infiniband/core/umem.c |   18 --
 include/rdma/ib_umem.h |1 +
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..01750d6 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->length= size;
umem->offset= addr & ~PAGE_MASK;
umem->page_size = PAGE_SIZE;
+   umem->pid   = get_task_pid(current, PIDTYPE_PID);
/*
 * We ask for writable memory if any access flags other than
 * "remote read" are set.  "Local write" and "remote write"
@@ -198,6 +199,7 @@ out:
if (ret < 0) {
if (need_release)
__ib_umem_release(context->device, umem, 0);
+   put_pid(umem->pid);
kfree(umem);
} else
current->mm->pinned_vm = locked;
@@ -230,15 +232,18 @@ void ib_umem_release(struct ib_umem *umem)
 {
struct ib_ucontext *context = umem->context;
struct mm_struct *mm;
+   struct task_struct *task;
unsigned long diff;
 
__ib_umem_release(umem->context->device, umem, 1);
 
-   mm = get_task_mm(current);
-   if (!mm) {
-   kfree(umem);
-   return;
-   }
+   task = get_pid_task(umem->pid, PIDTYPE_PID);
+   put_pid(umem->pid);
+   if (!task)
+   goto out;
+   mm = get_task_mm(task);
+   if (!mm)
+   goto out;
 
diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
 
@@ -262,9 +267,10 @@ void ib_umem_release(struct ib_umem *umem)
} else
down_write(>mmap_sem);
 
-   current->mm->pinned_vm -= diff;
+   mm->pinned_vm -= diff;
up_write(>mmap_sem);
mmput(mm);
+out:
kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1ea0b65..a2bf41e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -47,6 +47,7 @@ struct ib_umem {
int writable;
int hugetlb;
struct work_struct  work;
+   struct pid *pid;
struct mm_struct   *mm;
unsigned long   diff;
struct sg_table sg_head;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-08-28 Thread Shawn Bohrer
From: Shawn Bohrer sboh...@rgmadvisors.com

In debugging an application that receives -ENOMEM from ib_reg_mr() I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.

The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm-pinned_vm count incremented.  Later a
different process with a different mm_struct than the one that allocated
the ib_umem struct ends up releasing it which results in decrementing
the new processes mm-pinned_vm count past zero and wrapping.

I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel stack
trace of the freeing process from my situation looks like the following:

Call Trace:
 [814d64b1] dump_stack+0x19/0x1b
 [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core]
 [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
 [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core]
 [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
 [81141cba] __fput+0xba/0x240
 [81141e4e] fput+0xe/0x10
 [81060894] task_work_run+0xc4/0xe0
 [810029e5] do_notify_resume+0x95/0xa0
 [814e3dd0] int_signal+0x12/0x17

The following patch fixes the issue by storing the pid struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
---

v2 changes:
* Updated to use get_task_pid to avoid keeping a reference to the mm

I've run this patch on our test pool for general testing for a few days
and today verified that it solves the reported issue above on our
production machines.


 drivers/infiniband/core/umem.c |   18 --
 include/rdma/ib_umem.h |1 +
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..01750d6 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem-length= size;
umem-offset= addr  ~PAGE_MASK;
umem-page_size = PAGE_SIZE;
+   umem-pid   = get_task_pid(current, PIDTYPE_PID);
/*
 * We ask for writable memory if any access flags other than
 * remote read are set.  Local write and remote write
@@ -198,6 +199,7 @@ out:
if (ret  0) {
if (need_release)
__ib_umem_release(context-device, umem, 0);
+   put_pid(umem-pid);
kfree(umem);
} else
current-mm-pinned_vm = locked;
@@ -230,15 +232,18 @@ void ib_umem_release(struct ib_umem *umem)
 {
struct ib_ucontext *context = umem-context;
struct mm_struct *mm;
+   struct task_struct *task;
unsigned long diff;
 
__ib_umem_release(umem-context-device, umem, 1);
 
-   mm = get_task_mm(current);
-   if (!mm) {
-   kfree(umem);
-   return;
-   }
+   task = get_pid_task(umem-pid, PIDTYPE_PID);
+   put_pid(umem-pid);
+   if (!task)
+   goto out;
+   mm = get_task_mm(task);
+   if (!mm)
+   goto out;
 
diff = PAGE_ALIGN(umem-length + umem-offset)  PAGE_SHIFT;
 
@@ -262,9 +267,10 @@ void ib_umem_release(struct ib_umem *umem)
} else
down_write(mm-mmap_sem);
 
-   current-mm-pinned_vm -= diff;
+   mm-pinned_vm -= diff;
up_write(mm-mmap_sem);
mmput(mm);
+out:
kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1ea0b65..a2bf41e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -47,6 +47,7 @@ struct ib_umem {
int writable;
int hugetlb;
struct work_struct  work;
+   struct pid *pid;
struct mm_struct   *mm;
unsigned long   diff;
struct sg_table sg_head;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-08-28 Thread Shawn Bohrer
On Thu, Aug 28, 2014 at 02:48:19PM +0300, Haggai Eran wrote:
 On 26/08/2014 00:07, Shawn Bohrer wrote:
  The following patch fixes the issue by storing the mm_struct of the
   
   You are doing more than just storing the mm_struct - you are taking
   a reference to the process' mm.  This can lead to a massive resource
   leakage. The reason is bit complex: The destruction flow for IB
   uverbs is based upon releasing the file handle for it. Once the file
   handle is released, all MRs, QPs, CQs, PDs, etc. that the process
   allocated are released.  For the kernel to release the file handle,
   the kernel reference count to it needs to reach zero.  Most IB
   implementations expose some hardware registers to the application by
   allowing it to mmap the uverbs device file.  This mmap takes a
   reference to uverbs device file handle that the application opened.
   This reference is dropped when the process mm is released during the
   process destruction.  Your code takes a reference to the mm that
   will only be released when the parent MR/QP is released.
   
   Now, we have a deadlock - the mm is waiting for the MR to be
   destroyed, the MR is waiting for the file handle to be destroyed,
   and the file handle is waiting for the mm to be destroyed.
   
   The proper solution is to keep a reference to the task_pid (using
   get_task_pid), and use this pid to get the task_struct and from it
   the mm_struct during the destruction flow.
   
  I'll put together a patch using get_task_pid() and see if I can
  test/reproduce the issue.  This may take a couple of days since we
  have to test this in production at the moment.
   
 
 Hi,
 
 I just wanted to point out that while working on the on demand paging patches
 we also needed to keep a reference to the task pid (to make sure we always 
 handle page faults on behalf of the correct mm struct). You can find the 
 relevant code in the patch titled IB/core: Add support for on demand paging 
 regions [1].
 
 Haggai
 
 [1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20552.html

Haggai,

I looked over the on demand paging patch and I'm not sure if you are
suggesting that it already fixes my issue, or that I should use it as
a reference for my code.  In any case I just sent a v2 of a patch that
appears to fix my issue.

--
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-25 Thread Shawn Bohrer
On Thu, Aug 21, 2014 at 11:20:34AM +, Shachar Raindel wrote:
> Hi,
> 
> I'm afraid this patch, in its current form, will not work.
> See below for additional comments.

Thanks for the input Shachar.  I've tried to answer your questions
below.
 
> > > In debugging an application that receives -ENOMEM from ib_reg_mr() I
> > > found that ib_umem_get() can fail because the pinned_vm count has
> > > wrapped causing it to always be larger than the lock limit even with
> > > RLIMIT_MEMLOCK set to RLIM_INFINITY.
> > >
> > > The wrapping of pinned_vm occurs because the process that calls
> > > ib_reg_mr() will have its mm->pinned_vm count incremented.  Later a
> > > different process with a different mm_struct than the one that allocated
> > > the ib_umem struct ends up releasing it which results in decrementing
> > > the new processes mm->pinned_vm count past zero and wrapping.
> > >
> > > I'm not entirely sure what circumstances cause a different process to
> > > release the ib_umem than the one that allocated it but the kernel stack
> > > trace of the freeing process from my situation looks like the following:
> > >
> > > Call Trace:
> > >  [] dump_stack+0x19/0x1b
> > >  [] ib_umem_release+0x1f5/0x200 [ib_core]
> > >  [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
> > >  [] ib_destroy_qp+0x12c/0x170 [ib_core]
> > >  [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
> > >  [] __fput+0xba/0x240
> > >  [] fput+0xe/0x10
> > >  [] task_work_run+0xc4/0xe0
> > >  [] do_notify_resume+0x95/0xa0
> > >  [] int_signal+0x12/0x17
> > >
> 
> Can you provide the details of this issue - kernel version,
> reproduction steps, etc.?  It seems like the kernel code flow which
> triggers this is delaying the FD release done at
> http://lxr.free-electrons.com/source/fs/file_table.c#L279 .  The
> code there seems to have changed (starting at kernel 3.6) to avoid
> releasing a file in interrupt context or from a kernel thread.  How
> are we ending up with releasing the uverbs device file from an
> interrupt context or a kernel thread?

We are seeing this on 3.10.* kernels.  Unfortunately I'm not quite
sure what the reproducing steps are, because we can't reliably
reproduce it.  Or rather we have been able to reliably reproduce the
issue in certain production situations, but can't seem to reproduce it
outside of production so it seems we are missing something.

What I do know is that the issue often occurs when we try to replace a
set of processes with a new set of processes.  Both process sets will
be using RC and UD QPs.  When I finally discovered what the issue was,
I clearly saw an ib_umem struct allocated in one of the processes that
was going away get released in the context of one of the newly started
processes.

> > > The following patch fixes the issue by storing the mm_struct of the
> 
> You are doing more than just storing the mm_struct - you are taking
> a reference to the process' mm.  This can lead to a massive resource
> leakage. The reason is bit complex: The destruction flow for IB
> uverbs is based upon releasing the file handle for it. Once the file
> handle is released, all MRs, QPs, CQs, PDs, etc. that the process
> allocated are released.  For the kernel to release the file handle,
> the kernel reference count to it needs to reach zero.  Most IB
> implementations expose some hardware registers to the application by
> allowing it to mmap the uverbs device file.  This mmap takes a
> reference to uverbs device file handle that the application opened.
> This reference is dropped when the process mm is released during the
> process destruction.  Your code takes a reference to the mm that
> will only be released when the parent MR/QP is released.
> 
> Now, we have a deadlock - the mm is waiting for the MR to be
> destroyed, the MR is waiting for the file handle to be destroyed,
> and the file handle is waiting for the mm to be destroyed.
> 
> The proper solution is to keep a reference to the task_pid (using
> get_task_pid), and use this pid to get the task_struct and from it
> the mm_struct during the destruction flow.
 
I'll put together a patch using get_task_pid() and see if I can
test/reproduce the issue.  This may take a couple of days since we
have to test this in production at the moment.
 
> > > process that calls ib_umem_get() so that ib_umem_release and/or
> > > ib_umem_account() can properly decrement the pinned_vm count of the
> > > correct mm_struct.
> > >
> > > Signed-off-by: Shawn Bohrer 
> > > ---
> > >  drivers/infiniband/core/umem.c |   17 -
> > >  1 files chan

Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-08-25 Thread Shawn Bohrer
On Thu, Aug 21, 2014 at 11:20:34AM +, Shachar Raindel wrote:
 Hi,
 
 I'm afraid this patch, in its current form, will not work.
 See below for additional comments.

Thanks for the input Shachar.  I've tried to answer your questions
below.
 
   In debugging an application that receives -ENOMEM from ib_reg_mr() I
   found that ib_umem_get() can fail because the pinned_vm count has
   wrapped causing it to always be larger than the lock limit even with
   RLIMIT_MEMLOCK set to RLIM_INFINITY.
  
   The wrapping of pinned_vm occurs because the process that calls
   ib_reg_mr() will have its mm-pinned_vm count incremented.  Later a
   different process with a different mm_struct than the one that allocated
   the ib_umem struct ends up releasing it which results in decrementing
   the new processes mm-pinned_vm count past zero and wrapping.
  
   I'm not entirely sure what circumstances cause a different process to
   release the ib_umem than the one that allocated it but the kernel stack
   trace of the freeing process from my situation looks like the following:
  
   Call Trace:
[814d64b1] dump_stack+0x19/0x1b
[a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core]
[a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
[a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core]
[a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
[81141cba] __fput+0xba/0x240
[81141e4e] fput+0xe/0x10
[81060894] task_work_run+0xc4/0xe0
[810029e5] do_notify_resume+0x95/0xa0
[814e3dd0] int_signal+0x12/0x17
  
 
 Can you provide the details of this issue - kernel version,
 reproduction steps, etc.?  It seems like the kernel code flow which
 triggers this is delaying the FD release done at
 http://lxr.free-electrons.com/source/fs/file_table.c#L279 .  The
 code there seems to have changed (starting at kernel 3.6) to avoid
 releasing a file in interrupt context or from a kernel thread.  How
 are we ending up with releasing the uverbs device file from an
 interrupt context or a kernel thread?

We are seeing this on 3.10.* kernels.  Unfortunately I'm not quite
sure what the reproducing steps are, because we can't reliably
reproduce it.  Or rather we have been able to reliably reproduce the
issue in certain production situations, but can't seem to reproduce it
outside of production so it seems we are missing something.

What I do know is that the issue often occurs when we try to replace a
set of processes with a new set of processes.  Both process sets will
be using RC and UD QPs.  When I finally discovered what the issue was,
I clearly saw an ib_umem struct allocated in one of the processes that
was going away get released in the context of one of the newly started
processes.

   The following patch fixes the issue by storing the mm_struct of the
 
 You are doing more than just storing the mm_struct - you are taking
 a reference to the process' mm.  This can lead to a massive resource
 leakage. The reason is bit complex: The destruction flow for IB
 uverbs is based upon releasing the file handle for it. Once the file
 handle is released, all MRs, QPs, CQs, PDs, etc. that the process
 allocated are released.  For the kernel to release the file handle,
 the kernel reference count to it needs to reach zero.  Most IB
 implementations expose some hardware registers to the application by
 allowing it to mmap the uverbs device file.  This mmap takes a
 reference to uverbs device file handle that the application opened.
 This reference is dropped when the process mm is released during the
 process destruction.  Your code takes a reference to the mm that
 will only be released when the parent MR/QP is released.
 
 Now, we have a deadlock - the mm is waiting for the MR to be
 destroyed, the MR is waiting for the file handle to be destroyed,
 and the file handle is waiting for the mm to be destroyed.
 
 The proper solution is to keep a reference to the task_pid (using
 get_task_pid), and use this pid to get the task_struct and from it
 the mm_struct during the destruction flow.
 
I'll put together a patch using get_task_pid() and see if I can
test/reproduce the issue.  This may take a couple of days since we
have to test this in production at the moment.
 
   process that calls ib_umem_get() so that ib_umem_release and/or
   ib_umem_account() can properly decrement the pinned_vm count of the
   correct mm_struct.
  
   Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
   ---
drivers/infiniband/core/umem.c |   17 -
1 files changed, 8 insertions(+), 9 deletions(-)
  
   diff --git a/drivers/infiniband/core/umem.c
  b/drivers/infiniband/core/umem.c
   index a3a2e9c..32699024 100644
   --- a/drivers/infiniband/core/umem.c
   +++ b/drivers/infiniband/core/umem.c
   @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext
  *context, unsigned long addr,
 umem-length= size;
 umem-offset= addr

Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-20 Thread Shawn Bohrer
On Tue, Aug 12, 2014 at 11:27:35AM -0500, Shawn Bohrer wrote:
> From: Shawn Bohrer 
> 
> In debugging an application that receives -ENOMEM from ib_reg_mr() I
> found that ib_umem_get() can fail because the pinned_vm count has
> wrapped causing it to always be larger than the lock limit even with
> RLIMIT_MEMLOCK set to RLIM_INFINITY.
> 
> The wrapping of pinned_vm occurs because the process that calls
> ib_reg_mr() will have its mm->pinned_vm count incremented.  Later a
> different process with a different mm_struct than the one that allocated
> the ib_umem struct ends up releasing it which results in decrementing
> the new processes mm->pinned_vm count past zero and wrapping.
> 
> I'm not entirely sure what circumstances cause a different process to
> release the ib_umem than the one that allocated it but the kernel stack
> trace of the freeing process from my situation looks like the following:
> 
> Call Trace:
>  [] dump_stack+0x19/0x1b
>  [] ib_umem_release+0x1f5/0x200 [ib_core]
>  [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
>  [] ib_destroy_qp+0x12c/0x170 [ib_core]
>  [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
>  [] __fput+0xba/0x240
>  [] fput+0xe/0x10
>  [] task_work_run+0xc4/0xe0
>  [] do_notify_resume+0x95/0xa0
>  [] int_signal+0x12/0x17
> 
> The following patch fixes the issue by storing the mm_struct of the
> process that calls ib_umem_get() so that ib_umem_release and/or
> ib_umem_account() can properly decrement the pinned_vm count of the
> correct mm_struct.
> 
> Signed-off-by: Shawn Bohrer 
> ---
>  drivers/infiniband/core/umem.c |   17 -
>  1 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index a3a2e9c..32699024 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
> unsigned long addr,
>   umem->length= size;
>   umem->offset= addr & ~PAGE_MASK;
>   umem->page_size = PAGE_SIZE;
> + umem->mm= get_task_mm(current);
>   /*
>* We ask for writable memory if any access flags other than
>* "remote read" are set.  "Local write" and "remote write"
> @@ -198,6 +199,7 @@ out:
>   if (ret < 0) {
>   if (need_release)
>   __ib_umem_release(context->device, umem, 0);
> + mmput(umem->mm);
>   kfree(umem);
>   } else
>   current->mm->pinned_vm = locked;
> @@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work)
>  void ib_umem_release(struct ib_umem *umem)
>  {
>   struct ib_ucontext *context = umem->context;
> - struct mm_struct *mm;
>   unsigned long diff;
>  
>   __ib_umem_release(umem->context->device, umem, 1);
>  
> - mm = get_task_mm(current);
> - if (!mm) {
> + if (!umem->mm) {
>   kfree(umem);
>   return;
>   }
> @@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem)
>* we defer the vm_locked accounting to the system workqueue.
>*/
>   if (context->closing) {
> - if (!down_write_trylock(>mmap_sem)) {
> + if (!down_write_trylock(>mm->mmap_sem)) {
>   INIT_WORK(>work, ib_umem_account);
> - umem->mm   = mm;
>   umem->diff = diff;
>  
>   queue_work(ib_wq, >work);
>   return;
>   }
>   } else
> - down_write(>mmap_sem);
> + down_write(>mm->mmap_sem);
>  
> - current->mm->pinned_vm -= diff;
> - up_write(>mmap_sem);
> - mmput(mm);
> + umem->mm->pinned_vm -= diff;
> + up_write(>mm->mmap_sem);
> + mmput(umem->mm);
>   kfree(umem);
>  }
>  EXPORT_SYMBOL(ib_umem_release);

It doesn't look like this has been applied yet.  Does anyone have any
feedback?

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-08-20 Thread Shawn Bohrer
On Tue, Aug 12, 2014 at 11:27:35AM -0500, Shawn Bohrer wrote:
 From: Shawn Bohrer sboh...@rgmadvisors.com
 
 In debugging an application that receives -ENOMEM from ib_reg_mr() I
 found that ib_umem_get() can fail because the pinned_vm count has
 wrapped causing it to always be larger than the lock limit even with
 RLIMIT_MEMLOCK set to RLIM_INFINITY.
 
 The wrapping of pinned_vm occurs because the process that calls
 ib_reg_mr() will have its mm-pinned_vm count incremented.  Later a
 different process with a different mm_struct than the one that allocated
 the ib_umem struct ends up releasing it which results in decrementing
 the new processes mm-pinned_vm count past zero and wrapping.
 
 I'm not entirely sure what circumstances cause a different process to
 release the ib_umem than the one that allocated it but the kernel stack
 trace of the freeing process from my situation looks like the following:
 
 Call Trace:
  [814d64b1] dump_stack+0x19/0x1b
  [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core]
  [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
  [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core]
  [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
  [81141cba] __fput+0xba/0x240
  [81141e4e] fput+0xe/0x10
  [81060894] task_work_run+0xc4/0xe0
  [810029e5] do_notify_resume+0x95/0xa0
  [814e3dd0] int_signal+0x12/0x17
 
 The following patch fixes the issue by storing the mm_struct of the
 process that calls ib_umem_get() so that ib_umem_release and/or
 ib_umem_account() can properly decrement the pinned_vm count of the
 correct mm_struct.
 
 Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
 ---
  drivers/infiniband/core/umem.c |   17 -
  1 files changed, 8 insertions(+), 9 deletions(-)
 
 diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
 index a3a2e9c..32699024 100644
 --- a/drivers/infiniband/core/umem.c
 +++ b/drivers/infiniband/core/umem.c
 @@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
 unsigned long addr,
   umem-length= size;
   umem-offset= addr  ~PAGE_MASK;
   umem-page_size = PAGE_SIZE;
 + umem-mm= get_task_mm(current);
   /*
* We ask for writable memory if any access flags other than
* remote read are set.  Local write and remote write
 @@ -198,6 +199,7 @@ out:
   if (ret  0) {
   if (need_release)
   __ib_umem_release(context-device, umem, 0);
 + mmput(umem-mm);
   kfree(umem);
   } else
   current-mm-pinned_vm = locked;
 @@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work)
  void ib_umem_release(struct ib_umem *umem)
  {
   struct ib_ucontext *context = umem-context;
 - struct mm_struct *mm;
   unsigned long diff;
  
   __ib_umem_release(umem-context-device, umem, 1);
  
 - mm = get_task_mm(current);
 - if (!mm) {
 + if (!umem-mm) {
   kfree(umem);
   return;
   }
 @@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem)
* we defer the vm_locked accounting to the system workqueue.
*/
   if (context-closing) {
 - if (!down_write_trylock(mm-mmap_sem)) {
 + if (!down_write_trylock(umem-mm-mmap_sem)) {
   INIT_WORK(umem-work, ib_umem_account);
 - umem-mm   = mm;
   umem-diff = diff;
  
   queue_work(ib_wq, umem-work);
   return;
   }
   } else
 - down_write(mm-mmap_sem);
 + down_write(umem-mm-mmap_sem);
  
 - current-mm-pinned_vm -= diff;
 - up_write(mm-mmap_sem);
 - mmput(mm);
 + umem-mm-pinned_vm -= diff;
 + up_write(umem-mm-mmap_sem);
 + mmput(umem-mm);
   kfree(umem);
  }
  EXPORT_SYMBOL(ib_umem_release);

It doesn't look like this has been applied yet.  Does anyone have any
feedback?

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-12 Thread Shawn Bohrer
From: Shawn Bohrer 

In debugging an application that receives -ENOMEM from ib_reg_mr() I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.

The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm->pinned_vm count incremented.  Later a
different process with a different mm_struct than the one that allocated
the ib_umem struct ends up releasing it which results in decrementing
the new processes mm->pinned_vm count past zero and wrapping.

I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel stack
trace of the freeing process from my situation looks like the following:

Call Trace:
 [] dump_stack+0x19/0x1b
 [] ib_umem_release+0x1f5/0x200 [ib_core]
 [] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
 [] ib_destroy_qp+0x12c/0x170 [ib_core]
 [] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
 [] __fput+0xba/0x240
 [] fput+0xe/0x10
 [] task_work_run+0xc4/0xe0
 [] do_notify_resume+0x95/0xa0
 [] int_signal+0x12/0x17

The following patch fixes the issue by storing the mm_struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.

Signed-off-by: Shawn Bohrer 
---
 drivers/infiniband/core/umem.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..32699024 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->length= size;
umem->offset= addr & ~PAGE_MASK;
umem->page_size = PAGE_SIZE;
+   umem->mm= get_task_mm(current);
/*
 * We ask for writable memory if any access flags other than
 * "remote read" are set.  "Local write" and "remote write"
@@ -198,6 +199,7 @@ out:
if (ret < 0) {
if (need_release)
__ib_umem_release(context->device, umem, 0);
+   mmput(umem->mm);
kfree(umem);
} else
current->mm->pinned_vm = locked;
@@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work)
 void ib_umem_release(struct ib_umem *umem)
 {
struct ib_ucontext *context = umem->context;
-   struct mm_struct *mm;
unsigned long diff;
 
__ib_umem_release(umem->context->device, umem, 1);
 
-   mm = get_task_mm(current);
-   if (!mm) {
+   if (!umem->mm) {
kfree(umem);
return;
}
@@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem)
 * we defer the vm_locked accounting to the system workqueue.
 */
if (context->closing) {
-   if (!down_write_trylock(>mmap_sem)) {
+   if (!down_write_trylock(>mm->mmap_sem)) {
INIT_WORK(>work, ib_umem_account);
-   umem->mm   = mm;
umem->diff = diff;
 
queue_work(ib_wq, >work);
return;
}
} else
-   down_write(>mmap_sem);
+   down_write(>mm->mmap_sem);
 
-   current->mm->pinned_vm -= diff;
-   up_write(>mmap_sem);
-   mmput(mm);
+   umem->mm->pinned_vm -= diff;
+   up_write(>mm->mmap_sem);
+   mmput(umem->mm);
kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-08-12 Thread Shawn Bohrer
From: Shawn Bohrer sboh...@rgmadvisors.com

In debugging an application that receives -ENOMEM from ib_reg_mr() I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.

The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm-pinned_vm count incremented.  Later a
different process with a different mm_struct than the one that allocated
the ib_umem struct ends up releasing it which results in decrementing
the new processes mm-pinned_vm count past zero and wrapping.

I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel stack
trace of the freeing process from my situation looks like the following:

Call Trace:
 [814d64b1] dump_stack+0x19/0x1b
 [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core]
 [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
 [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core]
 [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
 [81141cba] __fput+0xba/0x240
 [81141e4e] fput+0xe/0x10
 [81060894] task_work_run+0xc4/0xe0
 [810029e5] do_notify_resume+0x95/0xa0
 [814e3dd0] int_signal+0x12/0x17

The following patch fixes the issue by storing the mm_struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
---
 drivers/infiniband/core/umem.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..32699024 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem-length= size;
umem-offset= addr  ~PAGE_MASK;
umem-page_size = PAGE_SIZE;
+   umem-mm= get_task_mm(current);
/*
 * We ask for writable memory if any access flags other than
 * remote read are set.  Local write and remote write
@@ -198,6 +199,7 @@ out:
if (ret  0) {
if (need_release)
__ib_umem_release(context-device, umem, 0);
+   mmput(umem-mm);
kfree(umem);
} else
current-mm-pinned_vm = locked;
@@ -229,13 +231,11 @@ static void ib_umem_account(struct work_struct *work)
 void ib_umem_release(struct ib_umem *umem)
 {
struct ib_ucontext *context = umem-context;
-   struct mm_struct *mm;
unsigned long diff;
 
__ib_umem_release(umem-context-device, umem, 1);
 
-   mm = get_task_mm(current);
-   if (!mm) {
+   if (!umem-mm) {
kfree(umem);
return;
}
@@ -251,20 +251,19 @@ void ib_umem_release(struct ib_umem *umem)
 * we defer the vm_locked accounting to the system workqueue.
 */
if (context-closing) {
-   if (!down_write_trylock(mm-mmap_sem)) {
+   if (!down_write_trylock(umem-mm-mmap_sem)) {
INIT_WORK(umem-work, ib_umem_account);
-   umem-mm   = mm;
umem-diff = diff;
 
queue_work(ib_wq, umem-work);
return;
}
} else
-   down_write(mm-mmap_sem);
+   down_write(umem-mm-mmap_sem);
 
-   current-mm-pinned_vm -= diff;
-   up_write(mm-mmap_sem);
-   mmput(mm);
+   umem-mm-pinned_vm -= diff;
+   up_write(umem-mm-mmap_sem);
+   mmput(umem-mm);
kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-20 Thread Shawn Bohrer
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote:
> > Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
> > testing this fix. On the surface it sounds correct however I'd like to
> > test this on top of 3.10.* since that is what we'll likely be running.
> > I've tried to apply Hugh's patch above on top of 3.10.19 but it
> > appears there are a number of conflicts.  Looking over the changes and
> > my understanding of the problem I believe on 3.10 only the
> > cgroup_free_fn needs to be run in a separate workqueue.  Below is the
> > patch I've applied on top of 3.10.19, which I'm about to start
> > testing.  If it looks like I botched the backport in any way please
> > let me know so I can test a propper fix on top of 3.10.19.
> > 
> 
> You didn't move css free_work to the dedicate wq as Tejun's patch does.
> css free_work won't acquire cgroup_mutex, but when destroying a lot of
> cgroups, we can have a lot of css free_work in the workqueue, so I'd
> suggest you also use cgroup_destroy_wq for it.

Well, I didn't move the css free_work, but I did test the patch I
posted on top of 3.10.19 and I am unable to reproduce the lockup so it
appears my patch was sufficient for 3.10.*.  Hopefully we can get this
fix applied and backported into stable.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-20 Thread Shawn Bohrer
On Tue, Nov 19, 2013 at 10:55:18AM +0800, Li Zefan wrote:
  Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
  testing this fix. On the surface it sounds correct however I'd like to
  test this on top of 3.10.* since that is what we'll likely be running.
  I've tried to apply Hugh's patch above on top of 3.10.19 but it
  appears there are a number of conflicts.  Looking over the changes and
  my understanding of the problem I believe on 3.10 only the
  cgroup_free_fn needs to be run in a separate workqueue.  Below is the
  patch I've applied on top of 3.10.19, which I'm about to start
  testing.  If it looks like I botched the backport in any way please
  let me know so I can test a propper fix on top of 3.10.19.
  
 
 You didn't move css free_work to the dedicate wq as Tejun's patch does.
 css free_work won't acquire cgroup_mutex, but when destroying a lot of
 cgroups, we can have a lot of css free_work in the workqueue, so I'd
 suggest you also use cgroup_destroy_wq for it.

Well, I didn't move the css free_work, but I did test the patch I
posted on top of 3.10.19 and I am unable to reproduce the lockup so it
appears my patch was sufficient for 3.10.*.  Hopefully we can get this
fix applied and backported into stable.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup_mutex deadlock

2013-11-18 Thread Shawn Bohrer
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote:
> Sorry for the delay: I was on the point of reporting success last
> night, when I tried a debug kernel: and that didn't work so well
> (got spinlock bad magic report in pwd_adjust_max_active(), and
> tests wouldn't run at all).
> 
> Even the non-early cgroup_init() is called well before the
> early_initcall init_workqueues(): though only the debug (lockdep
> and spinlock debug) kernel appeared to have a problem with that.
> 
> Here's the patch I ended up with successfully on a 3.11.7-based
> kernel (though below I've rediffed it against 3.11.8): the
> schedule_work->queue_work hunks are slightly different on 3.11
> than in your patch against current, and I did alloc_workqueue()
> from a separate core_initcall.
> 
> The interval between cgroup_init and that is a bit of a worry;
> but we don't seem to have suffered from the interval between
> cgroup_init and init_workqueues before (when system_wq is NULL)
> - though you may have more courage than I to reorder them!
> 
> Initially I backed out my system_highpri_wq workaround, and
> verified that it was still easy to reproduce the problem with
> one of our cgroup stresstests.  Yes it was, then your modified
> patch below convincingly fixed it.
> 
> I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
> as I'd expected, that didn't solve this issue (though it's worth
> our keeping it in to rule out another source of problems).  And I
> checked back on dumps of failures: they indeed show the tell-tale
> 256 kworkers doing cgroup_offline_fn, just as you predicted.
> 
> Thanks!
> Hugh
> 
> ---
>  kernel/cgroup.c | 30 +++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800
> +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800
> @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex);
>  static DEFINE_MUTEX(cgroup_root_mutex);
>  
>  /*
> + * cgroup destruction makes heavy use of work items and there can be a lot
> + * of concurrent destructions.  Use a separate workqueue so that cgroup
> + * destruction work items don't end up filling up max_active of system_wq
> + * which may lead to deadlock.
> + */
> +static struct workqueue_struct *cgroup_destroy_wq;
> +
> +/*
>   * Generate an array of cgroup subsystem pointers. At boot time, this is
>   * populated with the built in subsystems, and modular subsystems are
>   * registered after that. The mutable section of this array is protected by
> @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h
>   struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
>  
>   INIT_WORK(>destroy_work, cgroup_free_fn);
> - schedule_work(>destroy_work);
> + queue_work(cgroup_destroy_wq, >destroy_work);
>  }
>  
>  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re
>   struct cgroup_subsys_state *css =
>   container_of(ref, struct cgroup_subsys_state, refcnt);
>  
> - schedule_work(>dput_work);
> + queue_work(cgroup_destroy_wq, >dput_work);
>  }
>  
>  static void init_cgroup_css(struct cgroup_subsys_state *css,
> @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr
>  
>   /* percpu ref's of all css's are killed, kick off the next step */
>   INIT_WORK(>destroy_work, cgroup_offline_fn);
> - schedule_work(>destroy_work);
> + queue_work(cgroup_destroy_wq, >destroy_work);
>  }
>  
>  static void css_ref_killed_fn(struct percpu_ref *ref)
> @@ -4967,6 +4975,22 @@ out:
>   return err;
>  }
>  
> +static int __init cgroup_destroy_wq_init(void)
> +{
> + /*
> +  * There isn't much point in executing destruction path in
> +  * parallel.  Good chunk is serialized with cgroup_mutex anyway.
> +  * Use 1 for @max_active.
> +  *
> +  * We would prefer to do this in cgroup_init() above, but that
> +  * is called before init_workqueues(): so leave this until after.
> +  */
> + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> + BUG_ON(!cgroup_destroy_wq);
> + return 0;
> +}
> +core_initcall(cgroup_destroy_wq_init);
> +
>  /*
>   * proc_cgroup_show()
>   *  - Print task's cgroup paths into seq_file, one line for each hierarchy

Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
testing this fix. On the surface it sounds correct however I'd like to
test this on top of 3.10.* since that is what we'll likely be running.
I've tried to apply Hugh's patch above on top of 3.10.19 but it
appears there are a number of conflicts.  Looking over the changes and
my understanding of the problem I believe on 3.10 only the
cgroup_free_fn needs to be run in a separate workqueue.  Below is the
patch I've applied on top of 3.10.19, which I'm about to start
testing.  If it looks like I botched the backport in any way please
let me know so I 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-18 Thread Shawn Bohrer
On Sun, Nov 17, 2013 at 06:17:17PM -0800, Hugh Dickins wrote:
 Sorry for the delay: I was on the point of reporting success last
 night, when I tried a debug kernel: and that didn't work so well
 (got spinlock bad magic report in pwd_adjust_max_active(), and
 tests wouldn't run at all).
 
 Even the non-early cgroup_init() is called well before the
 early_initcall init_workqueues(): though only the debug (lockdep
 and spinlock debug) kernel appeared to have a problem with that.
 
 Here's the patch I ended up with successfully on a 3.11.7-based
 kernel (though below I've rediffed it against 3.11.8): the
 schedule_work-queue_work hunks are slightly different on 3.11
 than in your patch against current, and I did alloc_workqueue()
 from a separate core_initcall.
 
 The interval between cgroup_init and that is a bit of a worry;
 but we don't seem to have suffered from the interval between
 cgroup_init and init_workqueues before (when system_wq is NULL)
 - though you may have more courage than I to reorder them!
 
 Initially I backed out my system_highpri_wq workaround, and
 verified that it was still easy to reproduce the problem with
 one of our cgroup stresstests.  Yes it was, then your modified
 patch below convincingly fixed it.
 
 I ran with Johannes's patch adding extra mem_cgroup_reparent_charges:
 as I'd expected, that didn't solve this issue (though it's worth
 our keeping it in to rule out another source of problems).  And I
 checked back on dumps of failures: they indeed show the tell-tale
 256 kworkers doing cgroup_offline_fn, just as you predicted.
 
 Thanks!
 Hugh
 
 ---
  kernel/cgroup.c | 30 +++---
  1 file changed, 27 insertions(+), 3 deletions(-)
 
 --- 3.11.8/kernel/cgroup.c2013-11-17 17:40:54.200640692 -0800
 +++ linux/kernel/cgroup.c 2013-11-17 17:43:10.876643941 -0800
 @@ -89,6 +89,14 @@ static DEFINE_MUTEX(cgroup_mutex);
  static DEFINE_MUTEX(cgroup_root_mutex);
  
  /*
 + * cgroup destruction makes heavy use of work items and there can be a lot
 + * of concurrent destructions.  Use a separate workqueue so that cgroup
 + * destruction work items don't end up filling up max_active of system_wq
 + * which may lead to deadlock.
 + */
 +static struct workqueue_struct *cgroup_destroy_wq;
 +
 +/*
   * Generate an array of cgroup subsystem pointers. At boot time, this is
   * populated with the built in subsystems, and modular subsystems are
   * registered after that. The mutable section of this array is protected by
 @@ -890,7 +898,7 @@ static void cgroup_free_rcu(struct rcu_h
   struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
  
   INIT_WORK(cgrp-destroy_work, cgroup_free_fn);
 - schedule_work(cgrp-destroy_work);
 + queue_work(cgroup_destroy_wq, cgrp-destroy_work);
  }
  
  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 @@ -4205,7 +4213,7 @@ static void css_release(struct percpu_re
   struct cgroup_subsys_state *css =
   container_of(ref, struct cgroup_subsys_state, refcnt);
  
 - schedule_work(css-dput_work);
 + queue_work(cgroup_destroy_wq, css-dput_work);
  }
  
  static void init_cgroup_css(struct cgroup_subsys_state *css,
 @@ -4439,7 +4447,7 @@ static void cgroup_css_killed(struct cgr
  
   /* percpu ref's of all css's are killed, kick off the next step */
   INIT_WORK(cgrp-destroy_work, cgroup_offline_fn);
 - schedule_work(cgrp-destroy_work);
 + queue_work(cgroup_destroy_wq, cgrp-destroy_work);
  }
  
  static void css_ref_killed_fn(struct percpu_ref *ref)
 @@ -4967,6 +4975,22 @@ out:
   return err;
  }
  
 +static int __init cgroup_destroy_wq_init(void)
 +{
 + /*
 +  * There isn't much point in executing destruction path in
 +  * parallel.  Good chunk is serialized with cgroup_mutex anyway.
 +  * Use 1 for @max_active.
 +  *
 +  * We would prefer to do this in cgroup_init() above, but that
 +  * is called before init_workqueues(): so leave this until after.
 +  */
 + cgroup_destroy_wq = alloc_workqueue(cgroup_destroy, 0, 1);
 + BUG_ON(!cgroup_destroy_wq);
 + return 0;
 +}
 +core_initcall(cgroup_destroy_wq_init);
 +
  /*
   * proc_cgroup_show()
   *  - Print task's cgroup paths into seq_file, one line for each hierarchy

Thanks Tejun and Hugh.  Sorry for my late entry in getting around to
testing this fix. On the surface it sounds correct however I'd like to
test this on top of 3.10.* since that is what we'll likely be running.
I've tried to apply Hugh's patch above on top of 3.10.19 but it
appears there are a number of conflicts.  Looking over the changes and
my understanding of the problem I believe on 3.10 only the
cgroup_free_fn needs to be run in a separate workqueue.  Below is the
patch I've applied on top of 3.10.19, which I'm about to start
testing.  If it looks like I botched the backport in any way please
let me know so I can test a propper fix on top of 3.10.19.


---
 kernel/cgroup.c |   28 

Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Shawn Bohrer
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote:
> On Tue 12-11-13 09:55:30, Shawn Bohrer wrote:
> > On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote:
> > > On Tue 12-11-13 18:17:20, Li Zefan wrote:
> > > > Cc more people
> > > > 
> > > > On 2013/11/12 6:06, Shawn Bohrer wrote:
> > > > > Hello,
> > > > > 
> > > > > This morning I had a machine running 3.10.16 go unresponsive but
> > > > > before we killed it we were able to get the information below.  I'm
> > > > > not an expert here but it looks like most of the tasks below are
> > > > > blocking waiting on the cgroup_mutex.  You can see that the
> > > > > resource_alloca:16502 task is holding the cgroup_mutex and that task
> > > > > appears to be waiting on a lru_add_drain_all() to complete.
> > > 
> > > Do you have sysrq+l output as well by any chance? That would tell
> > > us what the current CPUs are doing. Dumping all kworker stacks
> > > might be helpful as well. We know that lru_add_drain_all waits for
> > > schedule_on_each_cpu to return so it is waiting for workers to finish.
> > > I would be really curious why some of lru_add_drain_cpu cannot finish
> > > properly. The only reason would be that some work item(s) do not get CPU
> > > or somebody is holding lru_lock.
> > 
> > In fact the sys-admin did manage to fire off a sysrq+l, I've put all
> > of the info from the syslog below.  I've looked it over and I'm not
> > sure it reveals anything.  First looking at the timestamps it appears
> > we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I
> > previously sent.
> 
> I would expect sysrq+w would still show those kworkers blocked on the
> same cgroup mutex?

Yes, I believe so.

> > I also have atop logs over that whole time period
> > that show hundreds of zombie processes which to me indicates that over
> > that 19.2 hours systemd remained wedged on the cgroup_mutex.  Looking
> > at the backtraces from the sysrq+l it appears most of the CPUs were
> > idle
> 
> Right so either we managed to sleep with the lru_lock held which sounds
> a bit improbable - but who knows - or there is some other problem. I
> would expect the later to be true.
> 
> lru_add_drain executes per-cpu and preemption disabled this means that
> its work item cannot be preempted so the only logical explanation seems
> to be that the work item has never got scheduled.

Meaning you think there would be no kworker thread for the
lru_add_drain at this point?  If so you might be correct.

> OK. In case the issue happens again. It would be very helpful to get the
> kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
> debugging tricks.

I set up one of my test pools with two scripts trying to reproduce the
problem.  One essentially puts tasks into several cpuset groups that
have cpuset.memory_migrate set, then takes them back out.  It also
occasionally switches cpuset.mems in those groups to try to keep the
memory of those tasks migrating between nodes.  The second script is:

$ cat /home/hbi/cgroup_mutex_cgroup_maker.sh 
#!/bin/bash

session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o)
cd /sys/fs/cgroup/systemd/user/hbi/${session_group}
pwd

while true; do
for x in $(seq 1 1000); do
mkdir $x
echo $$ > ${x}/tasks
echo $$ > tasks
rmdir $x
done
sleep .1
date
done

After running both concurrently on 40 machines for about 12 hours I've
managed to reproduce the issue at least once, possibly more.  One
machine looked identical to this reported issue.  It has a bunch of
stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
waiting on lru_add_drain_all().  A sysrq+l shows all CPUs are idle
except for the one triggering the sysrq+l.  The sysrq+w unfortunately
wrapped dmesg so we didn't get the stacks of all blocked tasks.  We
did however also cat /proc//stack of all kworker threads on the
system.  There were 265 kworker threads that all have the following
stack:

[kworker/2:1]
[] cgroup_free_fn+0x2c/0x120
[] process_one_work+0x174/0x490
[] worker_thread+0x11c/0x370
[] kthread+0xc0/0xd0
[] ret_from_fork+0x7c/0xb0
[] 0x

And there were another 101 that had stacks like the following:

[kworker/0:0]
[] worker_thread+0x1bf/0x370
[] kthread+0xc0/0xd0
[] ret_from_fork+0x7c/0xb0
[] 0x

That's it.  Again I'm not sure if that is helpful at all but it seems
to imply that the lru_add_drain_work was not scheduled.

I also managed to kill another two machines running my test.  One of
them we didn't get anything out of, and the other looks like I
deadlocked on the css_set_lock lock.  I'

Re: 3.10.16 cgroup_mutex deadlock

2013-11-14 Thread Shawn Bohrer
On Tue, Nov 12, 2013 at 05:55:04PM +0100, Michal Hocko wrote:
 On Tue 12-11-13 09:55:30, Shawn Bohrer wrote:
  On Tue, Nov 12, 2013 at 03:31:47PM +0100, Michal Hocko wrote:
   On Tue 12-11-13 18:17:20, Li Zefan wrote:
Cc more people

On 2013/11/12 6:06, Shawn Bohrer wrote:
 Hello,
 
 This morning I had a machine running 3.10.16 go unresponsive but
 before we killed it we were able to get the information below.  I'm
 not an expert here but it looks like most of the tasks below are
 blocking waiting on the cgroup_mutex.  You can see that the
 resource_alloca:16502 task is holding the cgroup_mutex and that task
 appears to be waiting on a lru_add_drain_all() to complete.
   
   Do you have sysrq+l output as well by any chance? That would tell
   us what the current CPUs are doing. Dumping all kworker stacks
   might be helpful as well. We know that lru_add_drain_all waits for
   schedule_on_each_cpu to return so it is waiting for workers to finish.
   I would be really curious why some of lru_add_drain_cpu cannot finish
   properly. The only reason would be that some work item(s) do not get CPU
   or somebody is holding lru_lock.
  
  In fact the sys-admin did manage to fire off a sysrq+l, I've put all
  of the info from the syslog below.  I've looked it over and I'm not
  sure it reveals anything.  First looking at the timestamps it appears
  we ran the sysrq+l 19.2 hours after the cgroup_mutex lockup I
  previously sent.
 
 I would expect sysrq+w would still show those kworkers blocked on the
 same cgroup mutex?

Yes, I believe so.

  I also have atop logs over that whole time period
  that show hundreds of zombie processes which to me indicates that over
  that 19.2 hours systemd remained wedged on the cgroup_mutex.  Looking
  at the backtraces from the sysrq+l it appears most of the CPUs were
  idle
 
 Right so either we managed to sleep with the lru_lock held which sounds
 a bit improbable - but who knows - or there is some other problem. I
 would expect the later to be true.
 
 lru_add_drain executes per-cpu and preemption disabled this means that
 its work item cannot be preempted so the only logical explanation seems
 to be that the work item has never got scheduled.

Meaning you think there would be no kworker thread for the
lru_add_drain at this point?  If so you might be correct.

 OK. In case the issue happens again. It would be very helpful to get the
 kworker and per-cpu stacks. Maybe Tejun can help with some waitqueue
 debugging tricks.

I set up one of my test pools with two scripts trying to reproduce the
problem.  One essentially puts tasks into several cpuset groups that
have cpuset.memory_migrate set, then takes them back out.  It also
occasionally switches cpuset.mems in those groups to try to keep the
memory of those tasks migrating between nodes.  The second script is:

$ cat /home/hbi/cgroup_mutex_cgroup_maker.sh 
#!/bin/bash

session_group=$(ps -o pid,cmd,cgroup -p $$ | grep -E 'c[0-9]+' -o)
cd /sys/fs/cgroup/systemd/user/hbi/${session_group}
pwd

while true; do
for x in $(seq 1 1000); do
mkdir $x
echo $$  ${x}/tasks
echo $$  tasks
rmdir $x
done
sleep .1
date
done

After running both concurrently on 40 machines for about 12 hours I've
managed to reproduce the issue at least once, possibly more.  One
machine looked identical to this reported issue.  It has a bunch of
stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach
waiting on lru_add_drain_all().  A sysrq+l shows all CPUs are idle
except for the one triggering the sysrq+l.  The sysrq+w unfortunately
wrapped dmesg so we didn't get the stacks of all blocked tasks.  We
did however also cat /proc/pid/stack of all kworker threads on the
system.  There were 265 kworker threads that all have the following
stack:

[kworker/2:1]
[810930ec] cgroup_free_fn+0x2c/0x120
[81057c54] process_one_work+0x174/0x490
[81058d0c] worker_thread+0x11c/0x370
[8105f0b0] kthread+0xc0/0xd0
[814c20dc] ret_from_fork+0x7c/0xb0
[] 0x

And there were another 101 that had stacks like the following:

[kworker/0:0]
[81058daf] worker_thread+0x1bf/0x370
[8105f0b0] kthread+0xc0/0xd0
[814c20dc] ret_from_fork+0x7c/0xb0
[] 0x

That's it.  Again I'm not sure if that is helpful at all but it seems
to imply that the lru_add_drain_work was not scheduled.

I also managed to kill another two machines running my test.  One of
them we didn't get anything out of, and the other looks like I
deadlocked on the css_set_lock lock.  I'll follow up with the
css_set_lock deadlock in another email since it doesn't look related
to this one.  But it does seem that I can probably reproduce this if
anyone has some debugging ideas.

--
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord

3.10.16 cgroup_mutex deadlock

2013-11-11 Thread Shawn Bohrer
Hello,

This morning I had a machine running 3.10.16 go unresponsive but
before we killed it we were able to get the information below.  I'm
not an expert here but it looks like most of the tasks below are
blocking waiting on the cgroup_mutex.  You can see that the
resource_alloca:16502 task is holding the cgroup_mutex and that task
appears to be waiting on a lru_add_drain_all() to complete.

Initially I thought the deadlock might simply be that the per cpu
workqueue work from lru_add_drain_all() is stuck waiting on the
cgroup_free_fn to complete.  However I've read
Documentation/workqueue.txt and it sounds like the current workqueue
has multiple kworker threads per cpu and thus this should not happen.
Both the cgroup_free_fn work and lru_add_dran_all() work run on the
system_wq which has max_active set to 0 so I believe multiple kworker
threads should run.  This also appears to be true since all of the
cgroup_free_fn are running on kworker/12 thread and there are multiple
blocked.

Perhaps someone with more experience in the cgroup and workqueue code
can look at the stacks below and identify the problem, or explain why
the lru_add_drain_all() work has not completed:


[694702.013850] INFO: task systemd:1 blocked for more than 120 seconds.
[694702.015794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694702.018217] systemd D 81607820 0 1  0 0x
[694702.020505]  88041dcc1d78 0086 88041dc7f100 
8110ad54
[694702.023006]  0001 88041dc78000 88041dcc1fd8 
88041dcc1fd8
[694702.025508]  88041dcc1fd8 88041dc78000 88041a1e8698 
81a417c0
[694702.028011] Call Trace:
[694702.028788]  [] ? vma_merge+0x124/0x330
[694702.030468]  [] schedule+0x29/0x70
[694702.032011]  [] schedule_preempt_disabled+0xe/0x10
[694702.033982]  [] __mutex_lock_slowpath+0x112/0x1b0
[694702.035926]  [] ? kmem_cache_alloc_trace+0x12d/0x160
[694702.037948]  [] mutex_lock+0x2a/0x50
[694702.039546]  [] proc_cgroup_show+0x67/0x1d0
[694702.041330]  [] seq_read+0x16b/0x3e0
[694702.042927]  [] vfs_read+0xb0/0x180
[694702.044498]  [] SyS_read+0x52/0xa0
[694702.046042]  [] system_call_fastpath+0x16/0x1b
[694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds.
[694702.050044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694702.052467] kworker/12:1D  0   203  2 0x
[694702.054756] Workqueue: events cgroup_free_fn
[694702.056139]  88041bc1fcf8 0046 88038e7b46a0 
00030001
[694702.058642]  88041bc1fd84 88041da6e9f0 88041bc1ffd8 
88041bc1ffd8
[694702.061144]  88041bc1ffd8 88041da6e9f0 0087 
81a417c0
[694702.063647] Call Trace:
[694702.064423]  [] schedule+0x29/0x70
[694702.065966]  [] schedule_preempt_disabled+0xe/0x10
[694702.067936]  [] __mutex_lock_slowpath+0x112/0x1b0
[694702.069879]  [] mutex_lock+0x2a/0x50
[694702.071476]  [] cgroup_free_fn+0x2c/0x120
[694702.073209]  [] process_one_work+0x174/0x490
[694702.075019]  [] worker_thread+0x11c/0x370
[694702.076748]  [] ? manage_workers+0x2c0/0x2c0
[694702.078560]  [] kthread+0xc0/0xd0
[694702.080078]  [] ? flush_kthread_worker+0xb0/0xb0
[694702.081995]  [] ret_from_fork+0x7c/0xb0
[694702.083671]  [] ? flush_kthread_worker+0xb0/0xb0
[694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 
seconds.
[694702.087801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694702.090225] systemd-logind  D 81607820 0  2885  1 0x
[694702.092513]  88041ac6fd88 0082 88041dd8aa60 
88041d9bc1a8
[694702.095014]  88041ac6fda0 88041cac9530 88041ac6ffd8 
88041ac6ffd8
[694702.097517]  88041ac6ffd8 88041cac9530 0c36 
81a417c0
[694702.100019] Call Trace:
[694702.100793]  [] schedule+0x29/0x70
[694702.102338]  [] schedule_preempt_disabled+0xe/0x10
[694702.104309]  [] __mutex_lock_slowpath+0x112/0x1b0
[694702.198316]  [] mutex_lock+0x2a/0x50
[694702.292456]  [] cgroup_lock_live_group+0x1d/0x40
[694702.386833]  [] cgroup_mkdir+0xa8/0x4b0
[694702.480679]  [] vfs_mkdir+0x84/0xd0
[694702.574124]  [] SyS_mkdirat+0x5e/0xe0
[694702.666986]  [] SyS_mkdir+0x19/0x20
[694702.758969]  [] system_call_fastpath+0x16/0x1b
[694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds.
[694702.935749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[694703.023603] kworker/12:2D 816079c0 0 11512  2 0x
[694703.109993] Workqueue: events cgroup_free_fn
[694703.193213]  88041b9dfcf8 0046 88041da6e9f0 
ea00106fd240
[694703.278353]  88041f803c00 8803824254c0 88041b9dffd8 
88041b9dffd8
[694703.363757]  88041b9dffd8 8803824254c0 001f17887bb1 
81a417c0
[694703.448550] Call Trace:
[694703.531773]  [] 

3.10.16 cgroup_mutex deadlock

2013-11-11 Thread Shawn Bohrer
Hello,

This morning I had a machine running 3.10.16 go unresponsive but
before we killed it we were able to get the information below.  I'm
not an expert here but it looks like most of the tasks below are
blocking waiting on the cgroup_mutex.  You can see that the
resource_alloca:16502 task is holding the cgroup_mutex and that task
appears to be waiting on a lru_add_drain_all() to complete.

Initially I thought the deadlock might simply be that the per cpu
workqueue work from lru_add_drain_all() is stuck waiting on the
cgroup_free_fn to complete.  However I've read
Documentation/workqueue.txt and it sounds like the current workqueue
has multiple kworker threads per cpu and thus this should not happen.
Both the cgroup_free_fn work and lru_add_dran_all() work run on the
system_wq which has max_active set to 0 so I believe multiple kworker
threads should run.  This also appears to be true since all of the
cgroup_free_fn are running on kworker/12 thread and there are multiple
blocked.

Perhaps someone with more experience in the cgroup and workqueue code
can look at the stacks below and identify the problem, or explain why
the lru_add_drain_all() work has not completed:


[694702.013850] INFO: task systemd:1 blocked for more than 120 seconds.
[694702.015794] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[694702.018217] systemd D 81607820 0 1  0 0x
[694702.020505]  88041dcc1d78 0086 88041dc7f100 
8110ad54
[694702.023006]  0001 88041dc78000 88041dcc1fd8 
88041dcc1fd8
[694702.025508]  88041dcc1fd8 88041dc78000 88041a1e8698 
81a417c0
[694702.028011] Call Trace:
[694702.028788]  [8110ad54] ? vma_merge+0x124/0x330
[694702.030468]  [814b8eb9] schedule+0x29/0x70
[694702.032011]  [814b918e] schedule_preempt_disabled+0xe/0x10
[694702.033982]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
[694702.035926]  [8112a2bd] ? kmem_cache_alloc_trace+0x12d/0x160
[694702.037948]  [814b742a] mutex_lock+0x2a/0x50
[694702.039546]  [81095b77] proc_cgroup_show+0x67/0x1d0
[694702.041330]  [8115925b] seq_read+0x16b/0x3e0
[694702.042927]  [811383d0] vfs_read+0xb0/0x180
[694702.044498]  [81138652] SyS_read+0x52/0xa0
[694702.046042]  [814c2182] system_call_fastpath+0x16/0x1b
[694702.047917] INFO: task kworker/12:1:203 blocked for more than 120 seconds.
[694702.050044] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[694702.052467] kworker/12:1D  0   203  2 0x
[694702.054756] Workqueue: events cgroup_free_fn
[694702.056139]  88041bc1fcf8 0046 88038e7b46a0 
00030001
[694702.058642]  88041bc1fd84 88041da6e9f0 88041bc1ffd8 
88041bc1ffd8
[694702.061144]  88041bc1ffd8 88041da6e9f0 0087 
81a417c0
[694702.063647] Call Trace:
[694702.064423]  [814b8eb9] schedule+0x29/0x70
[694702.065966]  [814b918e] schedule_preempt_disabled+0xe/0x10
[694702.067936]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
[694702.069879]  [814b742a] mutex_lock+0x2a/0x50
[694702.071476]  [810930ec] cgroup_free_fn+0x2c/0x120
[694702.073209]  [81057c54] process_one_work+0x174/0x490
[694702.075019]  [81058d0c] worker_thread+0x11c/0x370
[694702.076748]  [81058bf0] ? manage_workers+0x2c0/0x2c0
[694702.078560]  [8105f0b0] kthread+0xc0/0xd0
[694702.080078]  [8105eff0] ? flush_kthread_worker+0xb0/0xb0
[694702.081995]  [814c20dc] ret_from_fork+0x7c/0xb0
[694702.083671]  [8105eff0] ? flush_kthread_worker+0xb0/0xb0
[694702.085595] INFO: task systemd-logind:2885 blocked for more than 120 
seconds.
[694702.087801] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[694702.090225] systemd-logind  D 81607820 0  2885  1 0x
[694702.092513]  88041ac6fd88 0082 88041dd8aa60 
88041d9bc1a8
[694702.095014]  88041ac6fda0 88041cac9530 88041ac6ffd8 
88041ac6ffd8
[694702.097517]  88041ac6ffd8 88041cac9530 0c36 
81a417c0
[694702.100019] Call Trace:
[694702.100793]  [814b8eb9] schedule+0x29/0x70
[694702.102338]  [814b918e] schedule_preempt_disabled+0xe/0x10
[694702.104309]  [814b75b2] __mutex_lock_slowpath+0x112/0x1b0
[694702.198316]  [814b742a] mutex_lock+0x2a/0x50
[694702.292456]  [8108fa6d] cgroup_lock_live_group+0x1d/0x40
[694702.386833]  [810946c8] cgroup_mkdir+0xa8/0x4b0
[694702.480679]  [81145ea4] vfs_mkdir+0x84/0xd0
[694702.574124]  [8114791e] SyS_mkdirat+0x5e/0xe0
[694702.666986]  [811479b9] SyS_mkdir+0x19/0x20
[694702.758969]  [814c2182] system_call_fastpath+0x16/0x1b
[694702.848295] INFO: task kworker/12:2:11512 blocked for more than 120 seconds.
[694702.935749] 

3.10.16 general protection fault kmem_cache_alloc+0x67/0x170

2013-11-04 Thread Shawn Bohrer
I had a machine crash this weekend running a 3.10.16 kernel that
additionally has a few backported networking patches for performance
improvements.  At this point I can't rule out that the bug isn't from
those patches, and I haven't yet tried to see if I can reproduce the
crash.  I did happen to have kdump configured so I've got a crash dump
that I've been poking at but I'm not an expert here so hopefully
someone can provide some guidance on what I'm looking at and/or where
the bug might be.

Below is the more detailed info with some of my comments interspersed.
If anyone has any questions or suggestions I'd appreciate it.

[1448642.601229] general protection fault:  [#1] SMP 
[1448642.602448] Modules linked in: mpt2sas scsi_transport_sas raid_class 
mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp 
garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm 
ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev 
fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt 
iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw 
microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas 
uhci_hcd dm_mirror dm_region_hash dm_log dm_mod
[1448642.616810] CPU: 11 PID: 27807 Comm: primary_nic_is_ Not tainted 
3.10.16-1.rgm.fc16.x86_64 #1
[1448642.618941] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 
07/24/2012
[1448642.620639] task: 8806628c3880 ti: 88060437 task.ti: 
88060437
[1448642.622335] RIP: 0010:[]  [] 
kmem_cache_alloc+0x67/0x170
[1448642.624286] RSP: 0018:880604371d70  EFLAGS: 00010282
[1448642.625500] RAX:  RBX: 8806628c3880 RCX: 
7ecb996a
[1448642.627415] RDX: 7ecb9969 RSI: 00d0 RDI: 
00015900
[1448642.629077] RBP: 880604371dc0 R08: 880667d55900 R09: 

[1448642.630697] R10:  R11: 00015ea8 R12: 
880c67003800
[1448642.632316] R13: d17b94d6641aebfb R14: 81064d68 R15: 
00d0
[1448642.633936] FS:  7f8018827700() GS:880667d4() 
knlGS:
[1448642.635768] CS:  0010 DS:  ES:  CR0: 8005003b
[1448642.637497] CR2: 006eded4 CR3: 00066368b000 CR4: 
07e0
[1448642.639230] DR0:  DR1:  DR2: 

[1448642.640849] DR3:  DR6: 0ff0 DR7: 
0400
[1448642.642468] Stack:
[1448642.642950]  ff9c 8806628c3880 8806628c3880 
0002
[1448642.644780]  8806628c3880 8806628c3880  
01200011
[1448642.646848]   7f80188279d0 880604371de0 
81064d68
[1448642.648987] Call Trace:
[1448642.649568]  [] prepare_creds+0x28/0x160
[1448642.650822]  [] copy_creds+0x36/0x160
[1448642.652019]  [] copy_process+0x310/0x14b0
[1448642.653295]  [] ? __alloc_fd+0x45/0x110
[1448642.654529]  [] do_fork+0x9c/0x280
[1448642.655668]  [] ? get_unused_fd_flags+0x30/0x40
[1448642.657473]  [] ? __do_pipe_flags+0x7f/0xc0
[1448642.658808]  [] ? __fd_install+0x2b/0x60
[1448642.660062]  [] SyS_clone+0x16/0x20
[1448642.661222]  [] stub_clone+0x69/0x90
[1448642.662399]  [] ? system_call_fastpath+0x16/0x1b
[1448642.663807] Code: 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 f7 00 
00 00 48 85 c0 0f 84 ee 00 00 00 49 63 44 24 20 48 8d 4a 01 49 8b 3c 24 <49> 8b 
5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b5 49 
[1448642.671081] RIP  [] kmem_cache_alloc+0x67/0x170
[1448642.672508]  RSP 
[1448642.673330] ---[ end trace fe4b503d6f77c801 ]---
[1448642.674408] general protection fault:  [#2] SMP 
[1448642.675623] Modules linked in: mpt2sas scsi_transport_sas raid_class 
mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp 
garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm 
ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev 
fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt 
iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw 
microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas 
uhci_hcd dm_mirror dm_region_hash dm_log dm_mod
[1448642.690185] CPU: 11 PID: 27807 Comm: primary_nic_is_ Tainted: G  D 
 3.10.16-1.rgm.fc16.x86_64 #1
[1448642.692328] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 
07/24/2012
[1448642.694027] task: 8806628c3880 ti: 88060437 task.ti: 
88060437
[1448642.695726] RIP: 0010:[]  [] 
kmem_cache_alloc+0x67/0x170
[1448642.698133] RSP: 0018:880667d43ad0  EFLAGS: 00010282
[1448642.792981] RAX:  RBX: 81a9c580 RCX: 
7ecb996a
[1448642.889673] RDX: 7ecb9969 RSI: 0020 RDI: 
00015900
[1448642.994115] RBP: 880667d43b20 R08: 880667d55900 R09: 
81a9e5a0
[1448643.090682] R10: 

3.10.16 general protection fault kmem_cache_alloc+0x67/0x170

2013-11-04 Thread Shawn Bohrer
I had a machine crash this weekend running a 3.10.16 kernel that
additionally has a few backported networking patches for performance
improvements.  At this point I can't rule out that the bug isn't from
those patches, and I haven't yet tried to see if I can reproduce the
crash.  I did happen to have kdump configured so I've got a crash dump
that I've been poking at but I'm not an expert here so hopefully
someone can provide some guidance on what I'm looking at and/or where
the bug might be.

Below is the more detailed info with some of my comments interspersed.
If anyone has any questions or suggestions I'd appreciate it.

[1448642.601229] general protection fault:  [#1] SMP 
[1448642.602448] Modules linked in: mpt2sas scsi_transport_sas raid_class 
mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp 
garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm 
ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev 
fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt 
iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw 
microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas 
uhci_hcd dm_mirror dm_region_hash dm_log dm_mod
[1448642.616810] CPU: 11 PID: 27807 Comm: primary_nic_is_ Not tainted 
3.10.16-1.rgm.fc16.x86_64 #1
[1448642.618941] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 
07/24/2012
[1448642.620639] task: 8806628c3880 ti: 88060437 task.ti: 
88060437
[1448642.622335] RIP: 0010:[8112b117]  [8112b117] 
kmem_cache_alloc+0x67/0x170
[1448642.624286] RSP: 0018:880604371d70  EFLAGS: 00010282
[1448642.625500] RAX:  RBX: 8806628c3880 RCX: 
7ecb996a
[1448642.627415] RDX: 7ecb9969 RSI: 00d0 RDI: 
00015900
[1448642.629077] RBP: 880604371dc0 R08: 880667d55900 R09: 

[1448642.630697] R10:  R11: 00015ea8 R12: 
880c67003800
[1448642.632316] R13: d17b94d6641aebfb R14: 81064d68 R15: 
00d0
[1448642.633936] FS:  7f8018827700() GS:880667d4() 
knlGS:
[1448642.635768] CS:  0010 DS:  ES:  CR0: 8005003b
[1448642.637497] CR2: 006eded4 CR3: 00066368b000 CR4: 
07e0
[1448642.639230] DR0:  DR1:  DR2: 

[1448642.640849] DR3:  DR6: 0ff0 DR7: 
0400
[1448642.642468] Stack:
[1448642.642950]  ff9c 8806628c3880 8806628c3880 
0002
[1448642.644780]  8806628c3880 8806628c3880  
01200011
[1448642.646848]   7f80188279d0 880604371de0 
81064d68
[1448642.648987] Call Trace:
[1448642.649568]  [81064d68] prepare_creds+0x28/0x160
[1448642.650822]  [81065436] copy_creds+0x36/0x160
[1448642.652019]  [810393e0] copy_process+0x310/0x14b0
[1448642.653295]  [811534f5] ? __alloc_fd+0x45/0x110
[1448642.654529]  [8103a64c] do_fork+0x9c/0x280
[1448642.655668]  [811535f0] ? get_unused_fd_flags+0x30/0x40
[1448642.657473]  [8114147f] ? __do_pipe_flags+0x7f/0xc0
[1448642.658808]  [8115362b] ? __fd_install+0x2b/0x60
[1448642.660062]  [8103a8b6] SyS_clone+0x16/0x20
[1448642.661222]  [814c2429] stub_clone+0x69/0x90
[1448642.662399]  [814c2182] ? system_call_fastpath+0x16/0x1b
[1448642.663807] Code: 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 f7 00 
00 00 48 85 c0 0f 84 ee 00 00 00 49 63 44 24 20 48 8d 4a 01 49 8b 3c 24 49 8b 
5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b5 49 
[1448642.671081] RIP  [8112b117] kmem_cache_alloc+0x67/0x170
[1448642.672508]  RSP 880604371d70
[1448642.673330] ---[ end trace fe4b503d6f77c801 ]---
[1448642.674408] general protection fault:  [#2] SMP 
[1448642.675623] Modules linked in: mpt2sas scsi_transport_sas raid_class 
mptctl mptbase dell_rbu ipmi_devintf ipmi_si ipmi_msghandler lockd 8021q mrp 
garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm 
ib_addr iw_cxgb3 mlx4_ib ib_sa ib_mad ib_core mlx4_en ext4 jbd2 mbcache joydev 
fuse ses bnx2 coretemp mlx4_core cxgb3 hwmon mdio enclosure iTCO_wdt 
iTCO_vendor_support freq_table mperf wmi ehci_pci ehci_hcd dcdbas serio_raw 
microcode lpc_ich mfd_core sunrpc ipv6 autofs4 crc32c_intel megaraid_sas 
uhci_hcd dm_mirror dm_region_hash dm_log dm_mod
[1448642.690185] CPU: 11 PID: 27807 Comm: primary_nic_is_ Tainted: G  D 
 3.10.16-1.rgm.fc16.x86_64 #1
[1448642.692328] Hardware name: Dell Inc. PowerEdge R610/0XDN97, BIOS 6.3.0 
07/24/2012
[1448642.694027] task: 8806628c3880 ti: 88060437 task.ti: 
88060437
[1448642.695726] RIP: 0010:[8112b117]  [8112b117] 
kmem_cache_alloc+0x67/0x170
[1448642.698133] RSP: 0018:880667d43ad0  EFLAGS: 00010282

[tip:sched/core] sched/rt: Remove redundant nr_cpus_allowed test

2013-10-06 Thread tip-bot for Shawn Bohrer
Commit-ID:  6bfa687c19b7ab8adee03f0d43c197c2945dd869
Gitweb: http://git.kernel.org/tip/6bfa687c19b7ab8adee03f0d43c197c2945dd869
Author: Shawn Bohrer 
AuthorDate: Fri, 4 Oct 2013 14:24:53 -0500
Committer:  Ingo Molnar 
CommitDate: Sun, 6 Oct 2013 11:28:40 +0200

sched/rt: Remove redundant nr_cpus_allowed test

In 76854c7e8f3f4172fef091e78d88b3b751463ac6 ("sched: Use
rt.nr_cpus_allowed to recover select_task_rq() cycles") an
optimization was added to select_task_rq_rt() that immediately
returns when p->nr_cpus_allowed == 1 at the beginning of the
function.

This makes the latter p->nr_cpus_allowed > 1 check redundant,
which can now be removed.

Signed-off-by: Shawn Bohrer 
Reviewed-by: Steven Rostedt 
Cc: Mike Galbraith 
Cc: t...@rgmadvisors.com
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.boh...@gmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/rt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..ceebfba 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int 
flags)
 */
if (curr && unlikely(rt_task(curr)) &&
(curr->nr_cpus_allowed < 2 ||
-curr->prio <= p->prio) &&
-   (p->nr_cpus_allowed > 1)) {
+curr->prio <= p->prio)) {
int target = find_lowest_rq(p);
 
if (target != -1)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched/rt: Remove redundant nr_cpus_allowed test

2013-10-06 Thread tip-bot for Shawn Bohrer
Commit-ID:  6bfa687c19b7ab8adee03f0d43c197c2945dd869
Gitweb: http://git.kernel.org/tip/6bfa687c19b7ab8adee03f0d43c197c2945dd869
Author: Shawn Bohrer sboh...@rgmadvisors.com
AuthorDate: Fri, 4 Oct 2013 14:24:53 -0500
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Sun, 6 Oct 2013 11:28:40 +0200

sched/rt: Remove redundant nr_cpus_allowed test

In 76854c7e8f3f4172fef091e78d88b3b751463ac6 (sched: Use
rt.nr_cpus_allowed to recover select_task_rq() cycles) an
optimization was added to select_task_rq_rt() that immediately
returns when p-nr_cpus_allowed == 1 at the beginning of the
function.

This makes the latter p-nr_cpus_allowed  1 check redundant,
which can now be removed.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
Reviewed-by: Steven Rostedt rost...@goodmis.org
Cc: Mike Galbraith mgalbra...@suse.de
Cc: t...@rgmadvisors.com
Cc: Peter Zijlstra pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.boh...@gmail.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 kernel/sched/rt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..ceebfba 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int 
flags)
 */
if (curr  unlikely(rt_task(curr)) 
(curr-nr_cpus_allowed  2 ||
-curr-prio = p-prio) 
-   (p-nr_cpus_allowed  1)) {
+curr-prio = p-prio)) {
int target = find_lowest_rq(p);
 
if (target != -1)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched/rt: Remove redundant nr_cpus_allowed test

2013-10-04 Thread Shawn Bohrer
From: Shawn Bohrer 

In 76854c7e8f3f4172fef091e78d88b3b751463ac6 "sched: Use
rt.nr_cpus_allowed to recover select_task_rq() cycles" an optimization
was added to select_task_rq_rt() that immediately returns when
p->nr_cpus_allowed == 1 at the beginning of the function.  This makes
the latter p->nr_cpus_allowed > 1 check redundant and can be removed.

Signed-off-by: Shawn Bohrer 
---
 kernel/sched/rt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..ceebfba 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int 
flags)
 */
if (curr && unlikely(rt_task(curr)) &&
(curr->nr_cpus_allowed < 2 ||
-curr->prio <= p->prio) &&
-   (p->nr_cpus_allowed > 1)) {
+curr->prio <= p->prio)) {
int target = find_lowest_rq(p);
 
if (target != -1)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched/rt: Remove redundant nr_cpus_allowed test

2013-10-04 Thread Shawn Bohrer
From: Shawn Bohrer sboh...@rgmadvisors.com

In 76854c7e8f3f4172fef091e78d88b3b751463ac6 sched: Use
rt.nr_cpus_allowed to recover select_task_rq() cycles an optimization
was added to select_task_rq_rt() that immediately returns when
p-nr_cpus_allowed == 1 at the beginning of the function.  This makes
the latter p-nr_cpus_allowed  1 check redundant and can be removed.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
---
 kernel/sched/rt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..ceebfba 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1213,8 +1213,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int 
flags)
 */
if (curr  unlikely(rt_task(curr)) 
(curr-nr_cpus_allowed  2 ||
-curr-prio = p-prio) 
-   (p-nr_cpus_allowed  1)) {
+curr-prio = p-prio)) {
int target = find_lowest_rq(p);
 
if (target != -1)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] USB: Fix compilation error when CONFIG_PM disabled

2013-08-26 Thread Shawn Bohrer
Commit 9a11899c5e699a8d "USB: OHCI: add missing PCI PM callbacks to
ohci-pci.c" Introduced the following compilation errors when power
management is disabled:

drivers/usb/host/ohci-pci.c: In function 'ohci_pci_init':
drivers/usb/host/ohci-pci.c:309:35: error: 'ohci_suspend' undeclared (first use 
in this function)
drivers/usb/host/ohci-pci.c:309:35: note: each undeclared identifier is 
reported only once for each function it appears in
drivers/usb/host/ohci-pci.c:310:34: error: 'ohci_resume' undeclared (first use 
in this function)

ohci_suspend and ohci_resume are only defined when CONFIG_PM is defined
so only use them under CONFIG_PM.

Signed-off-by: Shawn Bohrer 
---
 drivers/usb/host/ohci-pci.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/host/ohci-pci.c b/drivers/usb/host/ohci-pci.c
index 0f1d193..062b410 100644
--- a/drivers/usb/host/ohci-pci.c
+++ b/drivers/usb/host/ohci-pci.c
@@ -305,9 +305,11 @@ static int __init ohci_pci_init(void)
 
ohci_init_driver(_pci_hc_driver, _overrides);
 
+#ifdef CONFIG_PM
/* Entries for the PCI suspend/resume callbacks are special */
ohci_pci_hc_driver.pci_suspend = ohci_suspend;
ohci_pci_hc_driver.pci_resume = ohci_resume;
+#endif
 
return pci_register_driver(_pci_driver);
 }
-- 
1.7.7.6


-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] USB: Fix compilation error when CONFIG_PM disabled

2013-08-26 Thread Shawn Bohrer
Commit 9a11899c5e699a8d USB: OHCI: add missing PCI PM callbacks to
ohci-pci.c Introduced the following compilation errors when power
management is disabled:

drivers/usb/host/ohci-pci.c: In function 'ohci_pci_init':
drivers/usb/host/ohci-pci.c:309:35: error: 'ohci_suspend' undeclared (first use 
in this function)
drivers/usb/host/ohci-pci.c:309:35: note: each undeclared identifier is 
reported only once for each function it appears in
drivers/usb/host/ohci-pci.c:310:34: error: 'ohci_resume' undeclared (first use 
in this function)

ohci_suspend and ohci_resume are only defined when CONFIG_PM is defined
so only use them under CONFIG_PM.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
---
 drivers/usb/host/ohci-pci.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/host/ohci-pci.c b/drivers/usb/host/ohci-pci.c
index 0f1d193..062b410 100644
--- a/drivers/usb/host/ohci-pci.c
+++ b/drivers/usb/host/ohci-pci.c
@@ -305,9 +305,11 @@ static int __init ohci_pci_init(void)
 
ohci_init_driver(ohci_pci_hc_driver, pci_overrides);
 
+#ifdef CONFIG_PM
/* Entries for the PCI suspend/resume callbacks are special */
ohci_pci_hc_driver.pci_suspend = ohci_suspend;
ohci_pci_hc_driver.pci_resume = ohci_resume;
+#endif
 
return pci_register_driver(ohci_pci_driver);
 }
-- 
1.7.7.6


-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net] net: rename and move busy poll mib counter

2013-08-06 Thread Shawn Bohrer
On Tue, Aug 06, 2013 at 03:14:48AM -0700, Eric Dumazet wrote:
> On Tue, 2013-08-06 at 12:52 +0300, Eliezer Tamir wrote:
> > Move the low latency mib counter to the ip section.
> > Rename it from low latency to busy poll.
> > 
> > Reported-by: Shawn Bohrer 
> > Signed-off-by: Eliezer Tamir 
> > ---
> 
> Well, it should not be part of IP mib, but a socket one (not existing so
> far)
> 
> Linux MIB already contains few non TCP counters :
> 
> LINUX_MIB_ARPFILTER
> LINUX_MIB_IPRPFILTER

Doesn't mean they are in the correct place either, but perhaps it's too
late for them.

> Its mostly populated by TCP counters, sure.

See, on the kernel side these are called "LINUX_MIB*" which seems
perfectly sane and I wouldn't even think the statistic is out of
place.  On the user-mode side these are all reported in
/proc/net/netstat as TcpExt statistics.  I can tell you that I don't
look at TCP statistics when I'm debugging/testing UDP issues
(apparently I should).

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net] net: rename and move busy poll mib counter

2013-08-06 Thread Shawn Bohrer
On Tue, Aug 06, 2013 at 03:14:48AM -0700, Eric Dumazet wrote:
 On Tue, 2013-08-06 at 12:52 +0300, Eliezer Tamir wrote:
  Move the low latency mib counter to the ip section.
  Rename it from low latency to busy poll.
  
  Reported-by: Shawn Bohrer sboh...@rgmadvisors.com
  Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com
  ---
 
 Well, it should not be part of IP mib, but a socket one (not existing so
 far)
 
 Linux MIB already contains few non TCP counters :
 
 LINUX_MIB_ARPFILTER
 LINUX_MIB_IPRPFILTER

Doesn't mean they are in the correct place either, but perhaps it's too
late for them.

 Its mostly populated by TCP counters, sure.

See, on the kernel side these are called LINUX_MIB* which seems
perfectly sane and I wouldn't even think the statistic is out of
place.  On the user-mode side these are all reported in
/proc/net/netstat as TcpExt statistics.  I can tell you that I don't
look at TCP statistics when I'm debugging/testing UDP issues
(apparently I should).

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10-rc4 stalls during mmap writes

2013-06-11 Thread Shawn Bohrer
On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote:
> On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote:
> > On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> > > On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > So to summarize it appears that most of the time was spent with
> > kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
> > up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
> > not missing some important intermediate steps does this provide any
> > more information about what resource I'm actually waiting for?  Does
> > this point to any changes that happened after 3.4?  Are there any tips
> > that could help minimize these contentions?
> 
> The only difference between this and 3.4 is the allocation workqueue
> thread. That, however, won't be introducing second long delays. What
> you are seeing here is simply a the latency of waiting for
> background metadata IO to complete during an allocation which has
> the ilock held

Again thank you for your analysis Dave.  I've taken a step back to
look at the big picture and that allowed me to identify what _has_
changed between 3.4 and 3.10.  What changed is the behavior of
vm.dirty_expire_centisecs.  Honestly, the previous behavior never made
any sense to me and I'm not entirely sure the current behavior does
either.

In the workload I've been debugging we append data to many small files
using mmap.  The writes are small and the total data rate is very low
thus for most files it may take several minutes to fill a page.
Having low-latency writes are important, but as you know stalls are
always possible.  One way to reduce the probability of a stall is to
reduce the frequency of writeback, and adjusting
vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should
allow us to do that.

On kernels 3.4 and older we chose to increase
vm.dirty_expire_centisecs to 3 since we can comfortably loose 5
minutes of data in the event of a system failure and we believed this
would cause a fairly consistent low data rate as every
vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages
that were vm.dirty_expire_centisecs (5min) old.  On old kernels that
isn't exactly what happened.  Instead every 5 minutes there would be a
burst of writeback and a slow trickle at all other times.  This also
reduced the total amount of data written back since the same dirty
page wasn't written back every 30 seconds.  This also virtually
eliminated the stalls we saw so it was left alone.

On 3.10 vm.dirty_expire_centisecs=3 no longer does the same thing.
Honestly I'm not sure what it does, but the result is a fairly
consistent high data rate being written back to disk.  The fact that
is consistent might lead me to believe that it writes back all pages
that are vm.dirty_expire_centisecs old every
vm.dirty_writeback_centisecs, but the data rate is far too high for
that to be true.  It appears that I can effectively get the same old
behavior by setting vm.dirty_writeback_centisecs=3.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10-rc4 stalls during mmap writes

2013-06-11 Thread Shawn Bohrer
On Tue, Jun 11, 2013 at 06:53:15AM +1000, Dave Chinner wrote:
 On Mon, Jun 10, 2013 at 01:45:59PM -0500, Shawn Bohrer wrote:
  On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
   On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
  So to summarize it appears that most of the time was spent with
  kworker/4:0 blocking in xfs_buf_lock() and kworker/2:1H, which is woken
  up by a softirq, is the one that calls xfs_buf_unlock().  Assuming I'm
  not missing some important intermediate steps does this provide any
  more information about what resource I'm actually waiting for?  Does
  this point to any changes that happened after 3.4?  Are there any tips
  that could help minimize these contentions?
 
 The only difference between this and 3.4 is the allocation workqueue
 thread. That, however, won't be introducing second long delays. What
 you are seeing here is simply a the latency of waiting for
 background metadata IO to complete during an allocation which has
 the ilock held

Again thank you for your analysis Dave.  I've taken a step back to
look at the big picture and that allowed me to identify what _has_
changed between 3.4 and 3.10.  What changed is the behavior of
vm.dirty_expire_centisecs.  Honestly, the previous behavior never made
any sense to me and I'm not entirely sure the current behavior does
either.

In the workload I've been debugging we append data to many small files
using mmap.  The writes are small and the total data rate is very low
thus for most files it may take several minutes to fill a page.
Having low-latency writes are important, but as you know stalls are
always possible.  One way to reduce the probability of a stall is to
reduce the frequency of writeback, and adjusting
vm.dirty_expire_centisecs and/or vm.dirty_writeback_centisecs should
allow us to do that.

On kernels 3.4 and older we chose to increase
vm.dirty_expire_centisecs to 3 since we can comfortably loose 5
minutes of data in the event of a system failure and we believed this
would cause a fairly consistent low data rate as every
vm.dirty_writeback_centisecs (5s) it would writeback all dirty pages
that were vm.dirty_expire_centisecs (5min) old.  On old kernels that
isn't exactly what happened.  Instead every 5 minutes there would be a
burst of writeback and a slow trickle at all other times.  This also
reduced the total amount of data written back since the same dirty
page wasn't written back every 30 seconds.  This also virtually
eliminated the stalls we saw so it was left alone.

On 3.10 vm.dirty_expire_centisecs=3 no longer does the same thing.
Honestly I'm not sure what it does, but the result is a fairly
consistent high data rate being written back to disk.  The fact that
is consistent might lead me to believe that it writes back all pages
that are vm.dirty_expire_centisecs old every
vm.dirty_writeback_centisecs, but the data rate is far too high for
that to be true.  It appears that I can effectively get the same old
behavior by setting vm.dirty_writeback_centisecs=3.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10-rc4 stalls during mmap writes

2013-06-10 Thread Shawn Bohrer
On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
> On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
> > So I guess my question is does anyone know why I'm now seeing these
> > stalls with 3.10?
> 
> Because we made all metadata updates in XFS fully transactional in
> 3.4:
> 
> commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
> Author: Christoph Hellwig 
> Date:   Wed Feb 29 09:53:52 2012 +
> 
> xfs: log timestamp updates
> 
> Timestamps on regular files are the last metadata that XFS does not update
> transactionally.  Now that we use the delaylog mode exclusively and made
> the log scode scale extremly well there is no need to bypass that code for
> timestamp updates.  Logging all updates allows to drop a lot of code, and
> will allow for further performance improvements later on.
> 
> Note that this patch drops optimized handling of fdatasync - it will be
> added back in a separate commit.
> 
> Reviewed-by: Dave Chinner 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Mark Tinguely 
> Signed-off-by: Ben Myers 
> $ git describe --contains 8a9c998
> v3.4-rc1~55^2~23
> 
> IOWs, you're just lucky you haven't noticed it on 3.4
> 
> > Are there any suggestions for how to eliminate them?
> 
> Nope. You're stuck with it - there's far more places in the page
> fault path where you can get stuck on the same lock for the same
> reason - e.g. during block mapping for the newly added pagecache
> page...
> 
> Hint: mmap() does not provide -deterministic- low latency access to
> mapped pages - it is only "mostly low latency".

Hi Dave, I appreciate your time and analysis.  I am sadly aware that
doing file I/O and expecting low latency is a bit of a stretch.  This
is also not the first time I've battled these types of stalls and
realize that the best I can do is search for opportunities to reduce
the probability of a stall, or find ways to reduce the duration of a
stall.

In this case since updating timestamps has been transactional and in
place since 3.4, it is obvious to me that this is not the cause of
both the increased rate and duration of the stalls on 3.10.  I assure
you on 3.4 we have very few stalls that are greater than 10
milliseconds in our normal workload and with 3.10 I'm seeing them
regularly.  Since we know I can, and likely do, hit the same code path
on 3.4 that tells me that me that the xfs_ilock() is likely being held
for a longer duration on the new kernel.  Let's see if we can
determine why the lock is held for so long.

Here is an attempt at that, as events unfold in time order.  I'm in no
way a filesystem developer so any input or analysis of what we're
waiting on is appreciated.  There is also multiple kworker threads
involved which certainly complicates things.


It starts with kworker/u49:0 which aquires xfs_ilock() inside
iomap_write_allocate().


   kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry:   
|  xfs_iomap_write_allocate() {
   kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:0.074 us   
|xfs_ilock();


In the next two chunks it appears that kworker/u49:0 calls
xfs_bmapi_allocate which offloads that work to kworker/4:0 calling
__xfs_bmapi_allocate().  kworker/4:0 ends up blocking on xfs_buf_lock().


 kworker/4:0-27520 [004] 256032.180389: sched_switch: 
prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D ==> 
next_comm=kworker/u49:0 next_pid=15748 next_prio=120
 kworker/4:0-27520 [004] 256032.180393: kernel_stack: 
=> schedule (814ca379)
=> schedule_timeout (814c810d)
=> __down_common (814c8e5e)
=> __down (814c8f26)
=> down (810658e1)
=> xfs_buf_lock (811c12a4)
=> _xfs_buf_find (811c1469)
=> xfs_buf_get_map (811c16e4)
=> xfs_buf_read_map (811c2691)
=> xfs_trans_read_buf_map (81225fa9)
=> xfs_btree_read_buf_block.constprop.6 (811f2242)
=> xfs_btree_lookup_get_block (811f22fb)
=> xfs_btree_lookup (811f6707)
=> xfs_alloc_ag_vextent_near (811d9d52)
=> xfs_alloc_ag_vextent (811da8b5)
=> xfs_alloc_vextent (811db545)
=> xfs_bmap_btalloc (811e6951)
=> xfs_bmap_alloc (811e6dee)
=> __xfs_bmapi_allocate (811ec024)
=> xfs_bmapi_allocate_worker (811ec283)
=> process_one_work (81059104)
=> worker_thread (8105a1bc)
=> kthread (810605f0)
=> ret_from_fork (814d395c)

kworker/u49:0-15748 [004] 256032.180403: sched_switch: 
prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D ==> 
next_comm=kworker/4:1H next_pid=3921 next_prio=100
   kworker/u49:0-15748 [004] 256032.180408: kernel_stack: 
=

Re: 3.10-rc4 stalls during mmap writes

2013-06-10 Thread Shawn Bohrer
On Sun, Jun 09, 2013 at 01:37:44PM +1000, Dave Chinner wrote:
 On Fri, Jun 07, 2013 at 02:37:12PM -0500, Shawn Bohrer wrote:
  So I guess my question is does anyone know why I'm now seeing these
  stalls with 3.10?
 
 Because we made all metadata updates in XFS fully transactional in
 3.4:
 
 commit 8a9c9980f24f6d86e0ec0150ed35fba45d0c9f88
 Author: Christoph Hellwig h...@infradead.org
 Date:   Wed Feb 29 09:53:52 2012 +
 
 xfs: log timestamp updates
 
 Timestamps on regular files are the last metadata that XFS does not update
 transactionally.  Now that we use the delaylog mode exclusively and made
 the log scode scale extremly well there is no need to bypass that code for
 timestamp updates.  Logging all updates allows to drop a lot of code, and
 will allow for further performance improvements later on.
 
 Note that this patch drops optimized handling of fdatasync - it will be
 added back in a separate commit.
 
 Reviewed-by: Dave Chinner dchin...@redhat.com
 Signed-off-by: Christoph Hellwig h...@lst.de
 Reviewed-by: Mark Tinguely tingu...@sgi.com
 Signed-off-by: Ben Myers b...@sgi.com
 $ git describe --contains 8a9c998
 v3.4-rc1~55^2~23
 
 IOWs, you're just lucky you haven't noticed it on 3.4
 
  Are there any suggestions for how to eliminate them?
 
 Nope. You're stuck with it - there's far more places in the page
 fault path where you can get stuck on the same lock for the same
 reason - e.g. during block mapping for the newly added pagecache
 page...
 
 Hint: mmap() does not provide -deterministic- low latency access to
 mapped pages - it is only mostly low latency.

Hi Dave, I appreciate your time and analysis.  I am sadly aware that
doing file I/O and expecting low latency is a bit of a stretch.  This
is also not the first time I've battled these types of stalls and
realize that the best I can do is search for opportunities to reduce
the probability of a stall, or find ways to reduce the duration of a
stall.

In this case since updating timestamps has been transactional and in
place since 3.4, it is obvious to me that this is not the cause of
both the increased rate and duration of the stalls on 3.10.  I assure
you on 3.4 we have very few stalls that are greater than 10
milliseconds in our normal workload and with 3.10 I'm seeing them
regularly.  Since we know I can, and likely do, hit the same code path
on 3.4 that tells me that me that the xfs_ilock() is likely being held
for a longer duration on the new kernel.  Let's see if we can
determine why the lock is held for so long.

Here is an attempt at that, as events unfold in time order.  I'm in no
way a filesystem developer so any input or analysis of what we're
waiting on is appreciated.  There is also multiple kworker threads
involved which certainly complicates things.


It starts with kworker/u49:0 which aquires xfs_ilock() inside
iomap_write_allocate().


   kworker/u49:0-15748 [004] 256032.180361: funcgraph_entry:   
|  xfs_iomap_write_allocate() {
   kworker/u49:0-15748 [004] 256032.180363: funcgraph_entry:0.074 us   
|xfs_ilock();


In the next two chunks it appears that kworker/u49:0 calls
xfs_bmapi_allocate which offloads that work to kworker/4:0 calling
__xfs_bmapi_allocate().  kworker/4:0 ends up blocking on xfs_buf_lock().


 kworker/4:0-27520 [004] 256032.180389: sched_switch: 
prev_comm=kworker/4:0 prev_pid=27520 prev_prio=120 prev_state=D == 
next_comm=kworker/u49:0 next_pid=15748 next_prio=120
 kworker/4:0-27520 [004] 256032.180393: kernel_stack: stack trace
= schedule (814ca379)
= schedule_timeout (814c810d)
= __down_common (814c8e5e)
= __down (814c8f26)
= down (810658e1)
= xfs_buf_lock (811c12a4)
= _xfs_buf_find (811c1469)
= xfs_buf_get_map (811c16e4)
= xfs_buf_read_map (811c2691)
= xfs_trans_read_buf_map (81225fa9)
= xfs_btree_read_buf_block.constprop.6 (811f2242)
= xfs_btree_lookup_get_block (811f22fb)
= xfs_btree_lookup (811f6707)
= xfs_alloc_ag_vextent_near (811d9d52)
= xfs_alloc_ag_vextent (811da8b5)
= xfs_alloc_vextent (811db545)
= xfs_bmap_btalloc (811e6951)
= xfs_bmap_alloc (811e6dee)
= __xfs_bmapi_allocate (811ec024)
= xfs_bmapi_allocate_worker (811ec283)
= process_one_work (81059104)
= worker_thread (8105a1bc)
= kthread (810605f0)
= ret_from_fork (814d395c)

kworker/u49:0-15748 [004] 256032.180403: sched_switch: 
prev_comm=kworker/u49:0 prev_pid=15748 prev_prio=120 prev_state=D == 
next_comm=kworker/4:1H next_pid=3921 next_prio=100
   kworker/u49:0-15748 [004] 256032.180408: kernel_stack: stack trace
= schedule (814ca379)
= schedule_timeout (814c810d)
= wait_for_completion (814ca2e5)
= xfs_bmapi_allocate (811eea57)
= xfs_bmapi_write (811eef33

3.10-rc4 stalls during mmap writes

2013-06-07 Thread Shawn Bohrer
I've started testing the 3.10 kernel, previously I was on 3.4, and I'm
encounting some fairly large stalls in my memory mapped writes in the
range of .01 to 1s.  I've managed to capture two of these stalls so
far and both looked like the following:

1) Writing process writes to a new page and blocks on xfs_ilock:

<...>-21567 [009]  9435.453069: sched_switch: prev_comm=tick_receiver_m 
prev_pid=21567 prev_prio=79 prev_state=D ==> next_comm=swapper/9 next_pid=0 
next_prio=120
<...>-21567 [009]  9435.453072: kernel_stack: 
=> schedule (814ca379)
=> rwsem_down_write_failed (814cb095)
=> call_rwsem_down_write_failed (81275053)
=> xfs_ilock (8120b25c)
=> xfs_vn_update_time (811cf3d3)
=> update_time (81158dd3)
=> file_update_time (81158f0c)
=> block_page_mkwrite (81171d23)
=> xfs_vm_page_mkwrite (811c5375)
=> do_wp_page (8110c27f)
=> handle_pte_fault (8110dd24)
=> handle_mm_fault (8110f430)
=> __do_page_fault (814cef72)
=> do_page_fault (814cf2e7)
=> page_fault (814cbab2)

2) kworker calls xfs_iunlock and wakes up my process:

   kworker/u50:1-403   [013]  9436.027354: sched_wakeup: 
comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009
   kworker/u50:1-403   [013]  9436.027359: kernel_stack: 
=> ttwu_do_activate.constprop.34 (8106c556)
=> try_to_wake_up (8106e996)
=> wake_up_process (8106ea87)
=> __rwsem_do_wake (8126e531)
=> rwsem_wake (8126e62a)
=> call_rwsem_wake (81275077)
=> xfs_iunlock (8120b55c)
=> xfs_iomap_write_allocate (811ce4e7)
=> xfs_map_blocks (811bf145)
=> xfs_vm_writepage (811bfbc2)
=> __writepage (810f14e7)
=> write_cache_pages (810f189e)
=> generic_writepages (810f1b3a)
=> xfs_vm_writepages (811bef8d)
=> do_writepages (810f3380)
=> __writeback_single_inode (81166ae5)
=> writeback_sb_inodes (81167d4d)
=> __writeback_inodes_wb (8116800e)
=> wb_writeback (811682bb)
=> wb_check_old_data_flush (811683ff)
=> wb_do_writeback (81169bd1)
=> bdi_writeback_workfn (81169cca)
=> process_one_work (81059104)
=> worker_thread (8105a1bc)
=> kthread (810605f0)
=> ret_from_fork (814d395c)

In this case my process stalled for roughly half a second:

<...>-21567 [009]  9436.027388: print:tracing_mark_write: stall 
of 0.574282

So I guess my question is does anyone know why I'm now seeing these
stalls with 3.10?  Are there any suggestions for how to eliminate them?

# xfs_info /home/
meta-data=/dev/sda5  isize=256agcount=4, agsize=67774016 blks
 =   sectsz=512   attr=2
data =   bsize=4096   blocks=271096064, imaxpct=5
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal   bsize=4096   blocks=132371, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

# grep sda5 /proc/mounts 
/dev/sda5 /home xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,noquota 0 0

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10-rc4 stalls during mmap writes

2013-06-07 Thread Shawn Bohrer
I've started testing the 3.10 kernel, previously I was on 3.4, and I'm
encounting some fairly large stalls in my memory mapped writes in the
range of .01 to 1s.  I've managed to capture two of these stalls so
far and both looked like the following:

1) Writing process writes to a new page and blocks on xfs_ilock:

...-21567 [009]  9435.453069: sched_switch: prev_comm=tick_receiver_m 
prev_pid=21567 prev_prio=79 prev_state=D == next_comm=swapper/9 next_pid=0 
next_prio=120
...-21567 [009]  9435.453072: kernel_stack: stack trace
= schedule (814ca379)
= rwsem_down_write_failed (814cb095)
= call_rwsem_down_write_failed (81275053)
= xfs_ilock (8120b25c)
= xfs_vn_update_time (811cf3d3)
= update_time (81158dd3)
= file_update_time (81158f0c)
= block_page_mkwrite (81171d23)
= xfs_vm_page_mkwrite (811c5375)
= do_wp_page (8110c27f)
= handle_pte_fault (8110dd24)
= handle_mm_fault (8110f430)
= __do_page_fault (814cef72)
= do_page_fault (814cf2e7)
= page_fault (814cbab2)

2) kworker calls xfs_iunlock and wakes up my process:

   kworker/u50:1-403   [013]  9436.027354: sched_wakeup: 
comm=tick_receiver_m pid=21567 prio=79 success=1 target_cpu=009
   kworker/u50:1-403   [013]  9436.027359: kernel_stack: stack trace
= ttwu_do_activate.constprop.34 (8106c556)
= try_to_wake_up (8106e996)
= wake_up_process (8106ea87)
= __rwsem_do_wake (8126e531)
= rwsem_wake (8126e62a)
= call_rwsem_wake (81275077)
= xfs_iunlock (8120b55c)
= xfs_iomap_write_allocate (811ce4e7)
= xfs_map_blocks (811bf145)
= xfs_vm_writepage (811bfbc2)
= __writepage (810f14e7)
= write_cache_pages (810f189e)
= generic_writepages (810f1b3a)
= xfs_vm_writepages (811bef8d)
= do_writepages (810f3380)
= __writeback_single_inode (81166ae5)
= writeback_sb_inodes (81167d4d)
= __writeback_inodes_wb (8116800e)
= wb_writeback (811682bb)
= wb_check_old_data_flush (811683ff)
= wb_do_writeback (81169bd1)
= bdi_writeback_workfn (81169cca)
= process_one_work (81059104)
= worker_thread (8105a1bc)
= kthread (810605f0)
= ret_from_fork (814d395c)

In this case my process stalled for roughly half a second:

...-21567 [009]  9436.027388: print:tracing_mark_write: stall 
of 0.574282

So I guess my question is does anyone know why I'm now seeing these
stalls with 3.10?  Are there any suggestions for how to eliminate them?

# xfs_info /home/
meta-data=/dev/sda5  isize=256agcount=4, agsize=67774016 blks
 =   sectsz=512   attr=2
data =   bsize=4096   blocks=271096064, imaxpct=5
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal   bsize=4096   blocks=132371, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

# grep sda5 /proc/mounts 
/dev/sda5 /home xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,noquota 0 0

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: old version of trace-cmd broken on 3.10 kernel

2013-05-31 Thread Shawn Bohrer
On Fri, May 31, 2013 at 01:05:35PM -0400, Steven Rostedt wrote:
> On Fri, 2013-05-31 at 11:50 -0500, Shawn Bohrer wrote:
> > Not sure if this is a big deal or not.  I've got an old version of
> > trace-cmd. It was built from git on 2012-09-12 but sadly I didn't
> > stash away the exact commit hash.  Anyway this version works fine on a
> > 3.4 kernel but on a 3.10-rc3 kernel it no longer works.  I just pulled
> > the latest trace-cmd from git and it works fine on the 3.10 kernel so
> > maybe this isn't an issue, but I don't typically expect applications
> > to break with a kernel upgrade.
> > 
> > When I run the old version on 3.10.0-rc3 I get the following output:
> 
> Yep, in 3.10 I fixed a long standing bug in the splice code, that when
> fixed, the old trace-cmd would fail.
> 
> I made a fix to all the stable releases of trace-cmd and posted it to
> LKML back on March 1st.
> 
> https://lkml.org/lkml/2013/3/1/596

Thanks Steve,

I'll just update my trace-cmd version.  It is about time for an update
anyway.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


old version of trace-cmd broken on 3.10 kernel

2013-05-31 Thread Shawn Bohrer
Not sure if this is a big deal or not.  I've got an old version of
trace-cmd. It was built from git on 2012-09-12 but sadly I didn't
stash away the exact commit hash.  Anyway this version works fine on a
3.4 kernel but on a 3.10-rc3 kernel it no longer works.  I just pulled
the latest trace-cmd from git and it works fine on the 3.10 kernel so
maybe this isn't an issue, but I don't typically expect applications
to break with a kernel upgrade.

When I run the old version on 3.10.0-rc3 I get the following output:

$ sudo trace-cmd record -e sched:sched_switch -e sched:sched_wakeup sleep 1
/sys/kernel/debug/tracing/events/sched/sched_wakeup/filter
/sys/kernel/debug/tracing/events/sched/sched_switch/filter
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
Kernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
recorded in the trace data. They should all be zero.

CPU: 0
entries: 93
overrun: 0
commit overrun: 0
bytes: 5112
oldest event ts:  7013.613622
now ts:  7013.656459
dropped events: 0
read events: 607

CPU: 1
entries: 88
overrun: 0
commit overrun: 0
bytes: 4396
oldest event ts:  7013.638352
now ts:  7013.656515
dropped events: 0
read events: 960

CPU: 2
entries: 154
overrun: 0
commit overrun: 0
bytes: 7592
oldest event ts:  7013.583332
now ts:  7013.656572
dropped events: 0
read events: 1211

CPU: 3
entries: 150
overrun: 0
commit overrun: 0
bytes: 7324
oldest event ts:  7013.589348
now ts:  7013.656625
dropped events: 0
read events: 1303

CPU: 4
entries: 84
overrun: 0
commit overrun: 0
bytes: 5096
oldest event ts:  7013.621239
now ts:  7013.656677
dropped events: 0
read events: 1175

CPU: 5
entries: 146
overrun: 0
commit overrun: 0
bytes: 9016
oldest event ts:  7013.601549
now ts:  7013.656729
dropped events: 0
read events: 2204

CPU: 6
entries: 77
overrun: 0
commit overrun: 0
bytes: 4824
oldest event ts:  7013.651234
now ts:  7013.656781
dropped events: 0
read events: 2148

CPU: 7
entries: 109
overrun: 0
commit overrun: 0
bytes: 6600
oldest event ts:  7013.621326
now ts:  7013.656837
dropped events: 0
read events: 1672

CPU: 8
entries: 110
overrun: 0
commit overrun: 0
bytes: 6804
oldest event ts:  7013.603146
now ts:  7013.656891
dropped events: 0
read events: 2272

CPU: 9
entries: 142
overrun: 0
commit overrun: 0
bytes: 8496
oldest event ts:  7013.584943
now ts:  7013.656942
dropped events: 0
read events: 1521

CPU: 10
entries: 98
overrun: 0
commit overrun: 0
bytes: 6408
oldest event ts:  7013.617605
now ts:  7013.656995
dropped events: 0
read events: 1706

CPU: 11
entries: 293
overrun: 0
commit overrun: 0
bytes: 19208
oldest event ts:  7013.607094
now ts:  7013.657047
dropped events: 0
read events: 9236

CPU: 12
entries: 152
overrun: 0
commit overrun: 0
bytes: 7136
oldest event ts:  7013.582819
now ts:  7013.657099
dropped events: 0
read events: 1112

CPU: 13
entries: 86
overrun: 0
commit overrun: 0
bytes: 3928
oldest event ts:  7013.560591
now ts:  7013.657150
dropped events: 0
read events: 769

CPU: 14
entries: 85
overrun: 0
commit overrun: 0
bytes: 4076
oldest event ts:  7013.426020
now ts:  7013.657202
dropped events: 0
read events: 586

CPU: 15
entries: 211
overrun: 0
commit overrun: 0
bytes: 9568
oldest event ts:  7013.578705
now ts:  7013.657253
dropped events: 0
read events: 1578

CPU: 16
entries: 114
overrun: 0
commit overrun: 0
bytes: 7104
oldest event ts:  7013.626635
now ts:  

old version of trace-cmd broken on 3.10 kernel

2013-05-31 Thread Shawn Bohrer
Not sure if this is a big deal or not.  I've got an old version of
trace-cmd. It was built from git on 2012-09-12 but sadly I didn't
stash away the exact commit hash.  Anyway this version works fine on a
3.4 kernel but on a 3.10-rc3 kernel it no longer works.  I just pulled
the latest trace-cmd from git and it works fine on the 3.10 kernel so
maybe this isn't an issue, but I don't typically expect applications
to break with a kernel upgrade.

When I run the old version on 3.10.0-rc3 I get the following output:

$ sudo trace-cmd record -e sched:sched_switch -e sched:sched_wakeup sleep 1
/sys/kernel/debug/tracing/events/sched/sched_wakeup/filter
/sys/kernel/debug/tracing/events/sched/sched_switch/filter
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
trace-cmd: Interrupted system call
  recorder error in splice input
Kernel buffer statistics:
  Note: entries are the entries left in the kernel ring buffer and are not
recorded in the trace data. They should all be zero.

CPU: 0
entries: 93
overrun: 0
commit overrun: 0
bytes: 5112
oldest event ts:  7013.613622
now ts:  7013.656459
dropped events: 0
read events: 607

CPU: 1
entries: 88
overrun: 0
commit overrun: 0
bytes: 4396
oldest event ts:  7013.638352
now ts:  7013.656515
dropped events: 0
read events: 960

CPU: 2
entries: 154
overrun: 0
commit overrun: 0
bytes: 7592
oldest event ts:  7013.583332
now ts:  7013.656572
dropped events: 0
read events: 1211

CPU: 3
entries: 150
overrun: 0
commit overrun: 0
bytes: 7324
oldest event ts:  7013.589348
now ts:  7013.656625
dropped events: 0
read events: 1303

CPU: 4
entries: 84
overrun: 0
commit overrun: 0
bytes: 5096
oldest event ts:  7013.621239
now ts:  7013.656677
dropped events: 0
read events: 1175

CPU: 5
entries: 146
overrun: 0
commit overrun: 0
bytes: 9016
oldest event ts:  7013.601549
now ts:  7013.656729
dropped events: 0
read events: 2204

CPU: 6
entries: 77
overrun: 0
commit overrun: 0
bytes: 4824
oldest event ts:  7013.651234
now ts:  7013.656781
dropped events: 0
read events: 2148

CPU: 7
entries: 109
overrun: 0
commit overrun: 0
bytes: 6600
oldest event ts:  7013.621326
now ts:  7013.656837
dropped events: 0
read events: 1672

CPU: 8
entries: 110
overrun: 0
commit overrun: 0
bytes: 6804
oldest event ts:  7013.603146
now ts:  7013.656891
dropped events: 0
read events: 2272

CPU: 9
entries: 142
overrun: 0
commit overrun: 0
bytes: 8496
oldest event ts:  7013.584943
now ts:  7013.656942
dropped events: 0
read events: 1521

CPU: 10
entries: 98
overrun: 0
commit overrun: 0
bytes: 6408
oldest event ts:  7013.617605
now ts:  7013.656995
dropped events: 0
read events: 1706

CPU: 11
entries: 293
overrun: 0
commit overrun: 0
bytes: 19208
oldest event ts:  7013.607094
now ts:  7013.657047
dropped events: 0
read events: 9236

CPU: 12
entries: 152
overrun: 0
commit overrun: 0
bytes: 7136
oldest event ts:  7013.582819
now ts:  7013.657099
dropped events: 0
read events: 1112

CPU: 13
entries: 86
overrun: 0
commit overrun: 0
bytes: 3928
oldest event ts:  7013.560591
now ts:  7013.657150
dropped events: 0
read events: 769

CPU: 14
entries: 85
overrun: 0
commit overrun: 0
bytes: 4076
oldest event ts:  7013.426020
now ts:  7013.657202
dropped events: 0
read events: 586

CPU: 15
entries: 211
overrun: 0
commit overrun: 0
bytes: 9568
oldest event ts:  7013.578705
now ts:  7013.657253
dropped events: 0
read events: 1578

CPU: 16
entries: 114
overrun: 0
commit overrun: 0
bytes: 7104
oldest event ts:  7013.626635
now ts:  7013.657304

Re: old version of trace-cmd broken on 3.10 kernel

2013-05-31 Thread Shawn Bohrer
On Fri, May 31, 2013 at 01:05:35PM -0400, Steven Rostedt wrote:
 On Fri, 2013-05-31 at 11:50 -0500, Shawn Bohrer wrote:
  Not sure if this is a big deal or not.  I've got an old version of
  trace-cmd. It was built from git on 2012-09-12 but sadly I didn't
  stash away the exact commit hash.  Anyway this version works fine on a
  3.4 kernel but on a 3.10-rc3 kernel it no longer works.  I just pulled
  the latest trace-cmd from git and it works fine on the 3.10 kernel so
  maybe this isn't an issue, but I don't typically expect applications
  to break with a kernel upgrade.
  
  When I run the old version on 3.10.0-rc3 I get the following output:
 
 Yep, in 3.10 I fixed a long standing bug in the splice code, that when
 fixed, the old trace-cmd would fail.
 
 I made a fix to all the stable releases of trace-cmd and posted it to
 LKML back on March 1st.
 
 https://lkml.org/lkml/2013/3/1/596

Thanks Steve,

I'll just update my trace-cmd version.  It is about time for an update
anyway.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: deadlock on vmap_area_lock

2013-05-02 Thread Shawn Bohrer
On Thu, May 02, 2013 at 08:03:04AM +1000, Dave Chinner wrote:
> On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote:
> > On Wed, 1 May 2013, Shawn Bohrer wrote:
> > 
> > > I've got two compute clusters with around 350 machines each which are
> > > running kernels based off of 3.1.9 (Yes I realize this is ancient by
> > > todays standards).
> 
> xfs_info output of one of those filesystems? What platform are you
> running (32 or 64 bit)?

# cat /proc/mounts | grep data-cache
/dev/sdb1 /data-cache xfs rw,nodiratime,relatime,attr2,delaylog,noquota 0 0
# xfs_info /data-cache 
meta-data=/dev/sdb1  isize=256agcount=4, agsize=66705344 blks
 =   sectsz=512   attr=2
data =   bsize=4096   blocks=266821376, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal   bsize=4096   blocks=130283, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

These are 64-bit systems.  The ones that hit the issue more frequently
have 96 GB of RAM.

> > > All of the machines run a 'find' command once an
> > > hour on one of the mounted XFS filesystems.  Occasionally these find
> > > commands get stuck requiring a reboot of the system.  I took a peek
> > > today and see this with perf:
> > > 
> > > 72.22%  find  [kernel.kallsyms]  [k] _raw_spin_lock
> > > |
> > > --- _raw_spin_lock
> > >|  
> > >|--98.84%-- vm_map_ram
> > >|  _xfs_buf_map_pages
> > >|  xfs_buf_get
> > >|  xfs_buf_read
> > >|  xfs_trans_read_buf
> > >|  xfs_da_do_buf
> > >|  xfs_da_read_buf
> > >|  xfs_dir2_block_getdents
> > >|  xfs_readdir
> > >|  xfs_file_readdir
> > >|  vfs_readdir
> > >|  sys_getdents
> > >|  system_call_fastpath
> > >|  __getdents64
> > >|  
> > >|--1.12%-- _xfs_buf_map_pages
> > >|  xfs_buf_get
> > >|  xfs_buf_read
> > >|  xfs_trans_read_buf
> > >|  xfs_da_do_buf
> > >|  xfs_da_read_buf
> > >|  xfs_dir2_block_getdents
> > >|  xfs_readdir
> > >|  xfs_file_readdir
> > >|  vfs_readdir
> > >|  sys_getdents
> > >|  system_call_fastpath
> > >|  __getdents64
> > > --0.04%-- [...]
> > > 
> > > Looking at the code my best guess is that we are spinning on
> > > vmap_area_lock, but I could be wrong.  This is the only process
> > > spinning on the machine so I'm assuming either another process has
> > > blocked while holding the lock, or perhaps this find process has tried
> > > to acquire the vmap_area_lock twice?
> > > 
> > 
> > Significant spinlock contention doesn't necessarily mean that there's a 
> > deadlock, but it also doesn't mean the opposite.  Depending on your 
> > definition of "occassionally", would it be possible to run with 
> > CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real 
> > deadlock potential?
> 
> It sure will. We've been reporting that vm_map_ram is doing
> GFP_KERNEL allocations from GFP_NOFS context for years, and have
> reported plenty of lockdep dumps as a result of it.
> 
> But that's not the problem that is occurring above - lockstat is
> probably a good thing to look at here to determine exactly what
> locks are being contended on

I've built a kernel with lock_stat, CONFIG_PROVE_LOCKING,
CONFIG_LOCKDEP and have one machine running with that kernel.  We'll
probably put machines on this debug kernel when we reboot them and
hop

Re: deadlock on vmap_area_lock

2013-05-02 Thread Shawn Bohrer
On Thu, May 02, 2013 at 08:03:04AM +1000, Dave Chinner wrote:
 On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote:
  On Wed, 1 May 2013, Shawn Bohrer wrote:
  
   I've got two compute clusters with around 350 machines each which are
   running kernels based off of 3.1.9 (Yes I realize this is ancient by
   todays standards).
 
 xfs_info output of one of those filesystems? What platform are you
 running (32 or 64 bit)?

# cat /proc/mounts | grep data-cache
/dev/sdb1 /data-cache xfs rw,nodiratime,relatime,attr2,delaylog,noquota 0 0
# xfs_info /data-cache 
meta-data=/dev/sdb1  isize=256agcount=4, agsize=66705344 blks
 =   sectsz=512   attr=2
data =   bsize=4096   blocks=266821376, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal   bsize=4096   blocks=130283, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

These are 64-bit systems.  The ones that hit the issue more frequently
have 96 GB of RAM.

   All of the machines run a 'find' command once an
   hour on one of the mounted XFS filesystems.  Occasionally these find
   commands get stuck requiring a reboot of the system.  I took a peek
   today and see this with perf:
   
   72.22%  find  [kernel.kallsyms]  [k] _raw_spin_lock
   |
   --- _raw_spin_lock
  |  
  |--98.84%-- vm_map_ram
  |  _xfs_buf_map_pages
  |  xfs_buf_get
  |  xfs_buf_read
  |  xfs_trans_read_buf
  |  xfs_da_do_buf
  |  xfs_da_read_buf
  |  xfs_dir2_block_getdents
  |  xfs_readdir
  |  xfs_file_readdir
  |  vfs_readdir
  |  sys_getdents
  |  system_call_fastpath
  |  __getdents64
  |  
  |--1.12%-- _xfs_buf_map_pages
  |  xfs_buf_get
  |  xfs_buf_read
  |  xfs_trans_read_buf
  |  xfs_da_do_buf
  |  xfs_da_read_buf
  |  xfs_dir2_block_getdents
  |  xfs_readdir
  |  xfs_file_readdir
  |  vfs_readdir
  |  sys_getdents
  |  system_call_fastpath
  |  __getdents64
   --0.04%-- [...]
   
   Looking at the code my best guess is that we are spinning on
   vmap_area_lock, but I could be wrong.  This is the only process
   spinning on the machine so I'm assuming either another process has
   blocked while holding the lock, or perhaps this find process has tried
   to acquire the vmap_area_lock twice?
   
  
  Significant spinlock contention doesn't necessarily mean that there's a 
  deadlock, but it also doesn't mean the opposite.  Depending on your 
  definition of occassionally, would it be possible to run with 
  CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real 
  deadlock potential?
 
 It sure will. We've been reporting that vm_map_ram is doing
 GFP_KERNEL allocations from GFP_NOFS context for years, and have
 reported plenty of lockdep dumps as a result of it.
 
 But that's not the problem that is occurring above - lockstat is
 probably a good thing to look at here to determine exactly what
 locks are being contended on

I've built a kernel with lock_stat, CONFIG_PROVE_LOCKING,
CONFIG_LOCKDEP and have one machine running with that kernel.  We'll
probably put machines on this debug kernel when we reboot them and
hopefully one will trigger the issue.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: deadlock on vmap_area_lock

2013-05-01 Thread Shawn Bohrer
On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote:
> On Wed, 1 May 2013, Shawn Bohrer wrote:
> 
> > I've got two compute clusters with around 350 machines each which are
> > running kernels based off of 3.1.9 (Yes I realize this is ancient by
> > todays standards).  All of the machines run a 'find' command once an
> > hour on one of the mounted XFS filesystems.  Occasionally these find
> > commands get stuck requiring a reboot of the system.  I took a peek
> > today and see this with perf:
> > 
> > 72.22%  find  [kernel.kallsyms]  [k] _raw_spin_lock
> > |
> > --- _raw_spin_lock
> >|  
> >|--98.84%-- vm_map_ram
> >|  _xfs_buf_map_pages
> >|  xfs_buf_get
> >|  xfs_buf_read
> >|  xfs_trans_read_buf
> >|  xfs_da_do_buf
> >|  xfs_da_read_buf
> >|  xfs_dir2_block_getdents
> >|  xfs_readdir
> >|  xfs_file_readdir
> >|  vfs_readdir
> >|  sys_getdents
> >|  system_call_fastpath
> >|  __getdents64
> >|  
> >|--1.12%-- _xfs_buf_map_pages
> >|  xfs_buf_get
> >|  xfs_buf_read
> >|  xfs_trans_read_buf
> >|  xfs_da_do_buf
> >|  xfs_da_read_buf
> >|  xfs_dir2_block_getdents
> >|  xfs_readdir
> >|  xfs_file_readdir
> >|  vfs_readdir
> >|  sys_getdents
> >|  system_call_fastpath
> >|  __getdents64
> > --0.04%-- [...]
> > 
> > Looking at the code my best guess is that we are spinning on
> > vmap_area_lock, but I could be wrong.  This is the only process
> > spinning on the machine so I'm assuming either another process has
> > blocked while holding the lock, or perhaps this find process has tried
> > to acquire the vmap_area_lock twice?
> > 
> 
> Significant spinlock contention doesn't necessarily mean that there's a 
> deadlock, but it also doesn't mean the opposite.

Correct it doesn't and I can't prove the find command is not making
progress, however these finds normally complete in under 15 min and
we've let the stuck ones run for days.  Additionally if this was just
contention I'd expect to see multiple threads/CPUs contending and I
only have a single CPU pegged running find at 99%. I should clarify
that the perf snippet above was for the entire system.  Profiling just
the find command shows:

82.56% find  [kernel.kallsyms]  [k] _raw_spin_lock
16.63% find  [kernel.kallsyms]  [k] vm_map_ram
 0.13% find  [kernel.kallsyms]  [k] hrtimer_interrupt
 0.04% find  [kernel.kallsyms]  [k] update_curr
 0.03% find  [igb]  [k] igb_poll
 0.03% find  [kernel.kallsyms]  [k] irqtime_account_process_tick
 0.03% find  [kernel.kallsyms]  [k] account_system_vtime
 0.03% find  [kernel.kallsyms]  [k] task_tick_fair
 0.03% find  [kernel.kallsyms]  [k] perf_event_task_tick
 0.03% find  [kernel.kallsyms]  [k] scheduler_tick
 0.03% find  [kernel.kallsyms]  [k] rb_erase
 0.02% find  [kernel.kallsyms]  [k] native_write_msr_safe
 0.02% find  [kernel.kallsyms]  [k] native_sched_clock
 0.02% find  [kernel.kallsyms]  [k] dma_issue_pending_all
 0.02% find  [kernel.kallsyms]  [k] handle_irq_event_percpu
 0.02% find  [kernel.kallsyms]  [k] timerqueue_del
 0.02% find  [kernel.kallsyms]  [k] run_timer_softirq
 0.02% find  [kernel.kallsyms]  [k] get_mm_counter
 0.02% find  [kernel.kallsyms]  [k] __rcu_pending
 0.02% find  [kernel.kallsyms]  [k] tick_program_event
 0.01% find  [kernel.kallsyms]  [k] __netif_receive_skb
 0.01% find  [kernel.kallsyms]  [k] ip_route_input_common
 0.01% find  [kernel.kallsyms]  [k] __insert_vmap_area
 0.01% find  [igb]  [k] igb_alloc_rx_buffers_adv
 0.01% find  [kernel.kallsyms]  [k] irq_exit
 0.01% find  [kern

deadlock on vmap_area_lock

2013-05-01 Thread Shawn Bohrer
I've got two compute clusters with around 350 machines each which are
running kernels based off of 3.1.9 (Yes I realize this is ancient by
todays standards).  All of the machines run a 'find' command once an
hour on one of the mounted XFS filesystems.  Occasionally these find
commands get stuck requiring a reboot of the system.  I took a peek
today and see this with perf:

72.22%  find  [kernel.kallsyms]  [k] _raw_spin_lock
|
--- _raw_spin_lock
   |  
   |--98.84%-- vm_map_ram
   |  _xfs_buf_map_pages
   |  xfs_buf_get
   |  xfs_buf_read
   |  xfs_trans_read_buf
   |  xfs_da_do_buf
   |  xfs_da_read_buf
   |  xfs_dir2_block_getdents
   |  xfs_readdir
   |  xfs_file_readdir
   |  vfs_readdir
   |  sys_getdents
   |  system_call_fastpath
   |  __getdents64
   |  
   |--1.12%-- _xfs_buf_map_pages
   |  xfs_buf_get
   |  xfs_buf_read
   |  xfs_trans_read_buf
   |  xfs_da_do_buf
   |  xfs_da_read_buf
   |  xfs_dir2_block_getdents
   |  xfs_readdir
   |  xfs_file_readdir
   |  vfs_readdir
   |  sys_getdents
   |  system_call_fastpath
   |  __getdents64
--0.04%-- [...]

Looking at the code my best guess is that we are spinning on
vmap_area_lock, but I could be wrong.  This is the only process
spinning on the machine so I'm assuming either another process has
blocked while holding the lock, or perhaps this find process has tried
to acquire the vmap_area_lock twice?

I've skimmed through the change logs between 3.1 and 3.9 but nothing
stood out as fix for this bug.  Does this ring a bell with anyone?  If
I have a machine that is currently in one of these stuck states does
anyone have any tips to identifying the processes currently holding
the lock?

Additionally as I mentioned before I have two clusters of roughly
equal size though one cluster hits this issue more frequently.  On
that cluster with approximately 350 machines we get about 10 stuck
machines a month.  The other cluster has about 450 machines but we
only get about 1 or 2 stuck machines a month.  Both clusters run the
same find command every hour, but the workloads on the machines are
different.  The cluster that hits the issue more frequently tends to
run more memory intensive jobs.

I'm open to building some debug kernels to help track this down,
though I can't upgrade all of the machines in one shot so it may take
a while to reproduce.  I'm happy to provide any other information if
people have questions.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


deadlock on vmap_area_lock

2013-05-01 Thread Shawn Bohrer
I've got two compute clusters with around 350 machines each which are
running kernels based off of 3.1.9 (Yes I realize this is ancient by
todays standards).  All of the machines run a 'find' command once an
hour on one of the mounted XFS filesystems.  Occasionally these find
commands get stuck requiring a reboot of the system.  I took a peek
today and see this with perf:

72.22%  find  [kernel.kallsyms]  [k] _raw_spin_lock
|
--- _raw_spin_lock
   |  
   |--98.84%-- vm_map_ram
   |  _xfs_buf_map_pages
   |  xfs_buf_get
   |  xfs_buf_read
   |  xfs_trans_read_buf
   |  xfs_da_do_buf
   |  xfs_da_read_buf
   |  xfs_dir2_block_getdents
   |  xfs_readdir
   |  xfs_file_readdir
   |  vfs_readdir
   |  sys_getdents
   |  system_call_fastpath
   |  __getdents64
   |  
   |--1.12%-- _xfs_buf_map_pages
   |  xfs_buf_get
   |  xfs_buf_read
   |  xfs_trans_read_buf
   |  xfs_da_do_buf
   |  xfs_da_read_buf
   |  xfs_dir2_block_getdents
   |  xfs_readdir
   |  xfs_file_readdir
   |  vfs_readdir
   |  sys_getdents
   |  system_call_fastpath
   |  __getdents64
--0.04%-- [...]

Looking at the code my best guess is that we are spinning on
vmap_area_lock, but I could be wrong.  This is the only process
spinning on the machine so I'm assuming either another process has
blocked while holding the lock, or perhaps this find process has tried
to acquire the vmap_area_lock twice?

I've skimmed through the change logs between 3.1 and 3.9 but nothing
stood out as fix for this bug.  Does this ring a bell with anyone?  If
I have a machine that is currently in one of these stuck states does
anyone have any tips to identifying the processes currently holding
the lock?

Additionally as I mentioned before I have two clusters of roughly
equal size though one cluster hits this issue more frequently.  On
that cluster with approximately 350 machines we get about 10 stuck
machines a month.  The other cluster has about 450 machines but we
only get about 1 or 2 stuck machines a month.  Both clusters run the
same find command every hour, but the workloads on the machines are
different.  The cluster that hits the issue more frequently tends to
run more memory intensive jobs.

I'm open to building some debug kernels to help track this down,
though I can't upgrade all of the machines in one shot so it may take
a while to reproduce.  I'm happy to provide any other information if
people have questions.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: deadlock on vmap_area_lock

2013-05-01 Thread Shawn Bohrer
On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote:
 On Wed, 1 May 2013, Shawn Bohrer wrote:
 
  I've got two compute clusters with around 350 machines each which are
  running kernels based off of 3.1.9 (Yes I realize this is ancient by
  todays standards).  All of the machines run a 'find' command once an
  hour on one of the mounted XFS filesystems.  Occasionally these find
  commands get stuck requiring a reboot of the system.  I took a peek
  today and see this with perf:
  
  72.22%  find  [kernel.kallsyms]  [k] _raw_spin_lock
  |
  --- _raw_spin_lock
 |  
 |--98.84%-- vm_map_ram
 |  _xfs_buf_map_pages
 |  xfs_buf_get
 |  xfs_buf_read
 |  xfs_trans_read_buf
 |  xfs_da_do_buf
 |  xfs_da_read_buf
 |  xfs_dir2_block_getdents
 |  xfs_readdir
 |  xfs_file_readdir
 |  vfs_readdir
 |  sys_getdents
 |  system_call_fastpath
 |  __getdents64
 |  
 |--1.12%-- _xfs_buf_map_pages
 |  xfs_buf_get
 |  xfs_buf_read
 |  xfs_trans_read_buf
 |  xfs_da_do_buf
 |  xfs_da_read_buf
 |  xfs_dir2_block_getdents
 |  xfs_readdir
 |  xfs_file_readdir
 |  vfs_readdir
 |  sys_getdents
 |  system_call_fastpath
 |  __getdents64
  --0.04%-- [...]
  
  Looking at the code my best guess is that we are spinning on
  vmap_area_lock, but I could be wrong.  This is the only process
  spinning on the machine so I'm assuming either another process has
  blocked while holding the lock, or perhaps this find process has tried
  to acquire the vmap_area_lock twice?
  
 
 Significant spinlock contention doesn't necessarily mean that there's a 
 deadlock, but it also doesn't mean the opposite.

Correct it doesn't and I can't prove the find command is not making
progress, however these finds normally complete in under 15 min and
we've let the stuck ones run for days.  Additionally if this was just
contention I'd expect to see multiple threads/CPUs contending and I
only have a single CPU pegged running find at 99%. I should clarify
that the perf snippet above was for the entire system.  Profiling just
the find command shows:

82.56% find  [kernel.kallsyms]  [k] _raw_spin_lock
16.63% find  [kernel.kallsyms]  [k] vm_map_ram
 0.13% find  [kernel.kallsyms]  [k] hrtimer_interrupt
 0.04% find  [kernel.kallsyms]  [k] update_curr
 0.03% find  [igb]  [k] igb_poll
 0.03% find  [kernel.kallsyms]  [k] irqtime_account_process_tick
 0.03% find  [kernel.kallsyms]  [k] account_system_vtime
 0.03% find  [kernel.kallsyms]  [k] task_tick_fair
 0.03% find  [kernel.kallsyms]  [k] perf_event_task_tick
 0.03% find  [kernel.kallsyms]  [k] scheduler_tick
 0.03% find  [kernel.kallsyms]  [k] rb_erase
 0.02% find  [kernel.kallsyms]  [k] native_write_msr_safe
 0.02% find  [kernel.kallsyms]  [k] native_sched_clock
 0.02% find  [kernel.kallsyms]  [k] dma_issue_pending_all
 0.02% find  [kernel.kallsyms]  [k] handle_irq_event_percpu
 0.02% find  [kernel.kallsyms]  [k] timerqueue_del
 0.02% find  [kernel.kallsyms]  [k] run_timer_softirq
 0.02% find  [kernel.kallsyms]  [k] get_mm_counter
 0.02% find  [kernel.kallsyms]  [k] __rcu_pending
 0.02% find  [kernel.kallsyms]  [k] tick_program_event
 0.01% find  [kernel.kallsyms]  [k] __netif_receive_skb
 0.01% find  [kernel.kallsyms]  [k] ip_route_input_common
 0.01% find  [kernel.kallsyms]  [k] __insert_vmap_area
 0.01% find  [igb]  [k] igb_alloc_rx_buffers_adv
 0.01% find  [kernel.kallsyms]  [k] irq_exit
 0.01% find  [kernel.kallsyms]  [k] acct_update_integrals
 0.01% find  [kernel.kallsyms]  [k] apic_timer_interrupt
 0.01% find  [kernel.kallsyms]  [k] tick_sched_timer
 0.01% find  [kernel.kallsyms]  [k] __remove_hrtimer
 0.01% find  [kernel.kallsyms]  [k] do_IRQ
 0.01% find  [kernel.kallsyms]  [k] dev_gro_receive
 0.01% find  [kernel.kallsyms]  [k] net_rx_action
 0.01

Re: 3.7 HDMI channel map regression

2013-02-17 Thread Shawn Bohrer
On Sun, Feb 17, 2013 at 09:34:53AM +0100, Takashi Iwai wrote:
> At Sat, 16 Feb 2013 18:22:25 -0600,
> Shawn Bohrer wrote:
> > 
> > On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote:
> > > On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote:
> > > > At Sun, 27 Jan 2013 19:18:27 -0600,
> > > > Shawn Bohrer wrote:
> > > > > 
> > > > > Hi Takashi,
> > > > > 
> > > > > I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
> > > > > and FC channels to swap, and my RR and LFE channels to swap for PCM
> > > > > audio.  Doing a git bisect identified
> > > > > d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the
> > > > > proper channel mapping for generic HDMI driver" as the commit that
> > > > > caused my channels to swap.  The commit doesn't revert cleanly on
> > > > > 3.7.4, and I haven't really looked to see what the correct fix might
> > > > > be.
> > > > > 
> > > > > Some info that may be relevant, the sound card is a:
> > > > > 
> > > > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
> > > > > Family High Definition Audio Controller (rev 04)
> > > > > 
> > > > > The machine is running Fedora 18 and audio goes over HDMI to a 5.1
> > > > > receiver.  I'm not really sure what other info you might need, but
> > > > > let me know if you need something else or have any patches you would
> > > > > like me to test.
> > > > 
> > > > OK, it's the first time to get a bug report about this.
> > > > Could you tell me how did you test it (i.e. which application, which
> > > > sound backend)?  Can you confirm that it's reproduced via speaker-test
> > > > program in alsa-utils package?
> > > 
> > > I originally noticed the problem when all of the dialog started coming
> > > out of my rear left speaker in MythTV after the kernel update.  Then I
> > > started using the Gnome 3 sound configuration gui in the system
> > > settings which has a speaker test and I assume is using pulseaudio.
> > > Running 'speaker-test -c6 -l1 -twav' also reproduces the problem.
> > > 
> > > For reference here are the versions of the various packages that I'm
> > > running:
> > > 
> > > alsa-utils-1.0.26-1.fc18.x86_64
> > > alsa-firmware-1.0.25-2.fc18.noarch
> > > alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64
> > > alsa-lib-devel-1.0.26-2.fc18.x86_64
> > > alsa-lib-1.0.26-2.fc18.x86_64
> > > alsa-tools-firmware-1.0.26.1-1.fc18.x86_64
> > > pulseaudio-gdm-hooks-2.1-5.fc18.x86_64
> > > pulseaudio-libs-2.1-5.fc18.x86_64
> > > pulseaudio-libs-glib2-2.1-5.fc18.x86_64
> > > pulseaudio-module-x11-2.1-5.fc18.x86_64
> > > pulseaudio-module-bluetooth-2.1-5.fc18.x86_64
> > > pulseaudio-2.1-5.fc18.x86_64
> > > pulseaudio-utils-2.1-5.fc18.x86_64
> > 
> > Hi Takashi,
> > 
> > Any updates on this issue?  I'd really like to see this issue fixed
> > and am happy to help in any way I can.  Until this gets fixed I'm
> > stuck on a 3.6.* kernel.
> 
> There is one fix in sound git tree regarding the HDMI channel map, but
> it's queued for 3.9 kernel (then backported to stable tree).
> Try sound.git tree or wait for a while until the upstream merge
> process above is done.

Thanks Takashi, I just tested the sound.git master branch and this
does indeed fix my issue.  I'll look forward to this going into 3.9.

--
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7 HDMI channel map regression

2013-02-17 Thread Shawn Bohrer
On Sun, Feb 17, 2013 at 09:34:53AM +0100, Takashi Iwai wrote:
 At Sat, 16 Feb 2013 18:22:25 -0600,
 Shawn Bohrer wrote:
  
  On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote:
   On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote:
At Sun, 27 Jan 2013 19:18:27 -0600,
Shawn Bohrer wrote:
 
 Hi Takashi,
 
 I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
 and FC channels to swap, and my RR and LFE channels to swap for PCM
 audio.  Doing a git bisect identified
 d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the
 proper channel mapping for generic HDMI driver as the commit that
 caused my channels to swap.  The commit doesn't revert cleanly on
 3.7.4, and I haven't really looked to see what the correct fix might
 be.
 
 Some info that may be relevant, the sound card is a:
 
 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
 Family High Definition Audio Controller (rev 04)
 
 The machine is running Fedora 18 and audio goes over HDMI to a 5.1
 receiver.  I'm not really sure what other info you might need, but
 let me know if you need something else or have any patches you would
 like me to test.

OK, it's the first time to get a bug report about this.
Could you tell me how did you test it (i.e. which application, which
sound backend)?  Can you confirm that it's reproduced via speaker-test
program in alsa-utils package?
   
   I originally noticed the problem when all of the dialog started coming
   out of my rear left speaker in MythTV after the kernel update.  Then I
   started using the Gnome 3 sound configuration gui in the system
   settings which has a speaker test and I assume is using pulseaudio.
   Running 'speaker-test -c6 -l1 -twav' also reproduces the problem.
   
   For reference here are the versions of the various packages that I'm
   running:
   
   alsa-utils-1.0.26-1.fc18.x86_64
   alsa-firmware-1.0.25-2.fc18.noarch
   alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64
   alsa-lib-devel-1.0.26-2.fc18.x86_64
   alsa-lib-1.0.26-2.fc18.x86_64
   alsa-tools-firmware-1.0.26.1-1.fc18.x86_64
   pulseaudio-gdm-hooks-2.1-5.fc18.x86_64
   pulseaudio-libs-2.1-5.fc18.x86_64
   pulseaudio-libs-glib2-2.1-5.fc18.x86_64
   pulseaudio-module-x11-2.1-5.fc18.x86_64
   pulseaudio-module-bluetooth-2.1-5.fc18.x86_64
   pulseaudio-2.1-5.fc18.x86_64
   pulseaudio-utils-2.1-5.fc18.x86_64
  
  Hi Takashi,
  
  Any updates on this issue?  I'd really like to see this issue fixed
  and am happy to help in any way I can.  Until this gets fixed I'm
  stuck on a 3.6.* kernel.
 
 There is one fix in sound git tree regarding the HDMI channel map, but
 it's queued for 3.9 kernel (then backported to stable tree).
 Try sound.git tree or wait for a while until the upstream merge
 process above is done.

Thanks Takashi, I just tested the sound.git master branch and this
does indeed fix my issue.  I'll look forward to this going into 3.9.

--
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7 HDMI channel map regression

2013-02-16 Thread Shawn Bohrer
On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote:
> On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote:
> > At Sun, 27 Jan 2013 19:18:27 -0600,
> > Shawn Bohrer wrote:
> > > 
> > > Hi Takashi,
> > > 
> > > I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
> > > and FC channels to swap, and my RR and LFE channels to swap for PCM
> > > audio.  Doing a git bisect identified
> > > d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the
> > > proper channel mapping for generic HDMI driver" as the commit that
> > > caused my channels to swap.  The commit doesn't revert cleanly on
> > > 3.7.4, and I haven't really looked to see what the correct fix might
> > > be.
> > > 
> > > Some info that may be relevant, the sound card is a:
> > > 
> > > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
> > > Family High Definition Audio Controller (rev 04)
> > > 
> > > The machine is running Fedora 18 and audio goes over HDMI to a 5.1
> > > receiver.  I'm not really sure what other info you might need, but
> > > let me know if you need something else or have any patches you would
> > > like me to test.
> > 
> > OK, it's the first time to get a bug report about this.
> > Could you tell me how did you test it (i.e. which application, which
> > sound backend)?  Can you confirm that it's reproduced via speaker-test
> > program in alsa-utils package?
> 
> I originally noticed the problem when all of the dialog started coming
> out of my rear left speaker in MythTV after the kernel update.  Then I
> started using the Gnome 3 sound configuration gui in the system
> settings which has a speaker test and I assume is using pulseaudio.
> Running 'speaker-test -c6 -l1 -twav' also reproduces the problem.
> 
> For reference here are the versions of the various packages that I'm
> running:
> 
> alsa-utils-1.0.26-1.fc18.x86_64
> alsa-firmware-1.0.25-2.fc18.noarch
> alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64
> alsa-lib-devel-1.0.26-2.fc18.x86_64
> alsa-lib-1.0.26-2.fc18.x86_64
> alsa-tools-firmware-1.0.26.1-1.fc18.x86_64
> pulseaudio-gdm-hooks-2.1-5.fc18.x86_64
> pulseaudio-libs-2.1-5.fc18.x86_64
> pulseaudio-libs-glib2-2.1-5.fc18.x86_64
> pulseaudio-module-x11-2.1-5.fc18.x86_64
> pulseaudio-module-bluetooth-2.1-5.fc18.x86_64
> pulseaudio-2.1-5.fc18.x86_64
> pulseaudio-utils-2.1-5.fc18.x86_64

Hi Takashi,

Any updates on this issue?  I'd really like to see this issue fixed
and am happy to help in any way I can.  Until this gets fixed I'm
stuck on a 3.6.* kernel.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7 HDMI channel map regression

2013-02-16 Thread Shawn Bohrer
On Mon, Jan 28, 2013 at 08:52:05PM -0600, Shawn Bohrer wrote:
 On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote:
  At Sun, 27 Jan 2013 19:18:27 -0600,
  Shawn Bohrer wrote:
   
   Hi Takashi,
   
   I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
   and FC channels to swap, and my RR and LFE channels to swap for PCM
   audio.  Doing a git bisect identified
   d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the
   proper channel mapping for generic HDMI driver as the commit that
   caused my channels to swap.  The commit doesn't revert cleanly on
   3.7.4, and I haven't really looked to see what the correct fix might
   be.
   
   Some info that may be relevant, the sound card is a:
   
   00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
   Family High Definition Audio Controller (rev 04)
   
   The machine is running Fedora 18 and audio goes over HDMI to a 5.1
   receiver.  I'm not really sure what other info you might need, but
   let me know if you need something else or have any patches you would
   like me to test.
  
  OK, it's the first time to get a bug report about this.
  Could you tell me how did you test it (i.e. which application, which
  sound backend)?  Can you confirm that it's reproduced via speaker-test
  program in alsa-utils package?
 
 I originally noticed the problem when all of the dialog started coming
 out of my rear left speaker in MythTV after the kernel update.  Then I
 started using the Gnome 3 sound configuration gui in the system
 settings which has a speaker test and I assume is using pulseaudio.
 Running 'speaker-test -c6 -l1 -twav' also reproduces the problem.
 
 For reference here are the versions of the various packages that I'm
 running:
 
 alsa-utils-1.0.26-1.fc18.x86_64
 alsa-firmware-1.0.25-2.fc18.noarch
 alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64
 alsa-lib-devel-1.0.26-2.fc18.x86_64
 alsa-lib-1.0.26-2.fc18.x86_64
 alsa-tools-firmware-1.0.26.1-1.fc18.x86_64
 pulseaudio-gdm-hooks-2.1-5.fc18.x86_64
 pulseaudio-libs-2.1-5.fc18.x86_64
 pulseaudio-libs-glib2-2.1-5.fc18.x86_64
 pulseaudio-module-x11-2.1-5.fc18.x86_64
 pulseaudio-module-bluetooth-2.1-5.fc18.x86_64
 pulseaudio-2.1-5.fc18.x86_64
 pulseaudio-utils-2.1-5.fc18.x86_64

Hi Takashi,

Any updates on this issue?  I'd really like to see this issue fixed
and am happy to help in any way I can.  Until this gets fixed I'm
stuck on a 3.6.* kernel.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7 HDMI channel map regression

2013-01-28 Thread Shawn Bohrer
On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote:
> At Sun, 27 Jan 2013 19:18:27 -0600,
> Shawn Bohrer wrote:
> > 
> > Hi Takashi,
> > 
> > I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
> > and FC channels to swap, and my RR and LFE channels to swap for PCM
> > audio.  Doing a git bisect identified
> > d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the
> > proper channel mapping for generic HDMI driver" as the commit that
> > caused my channels to swap.  The commit doesn't revert cleanly on
> > 3.7.4, and I haven't really looked to see what the correct fix might
> > be.
> > 
> > Some info that may be relevant, the sound card is a:
> > 
> > 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
> > Family High Definition Audio Controller (rev 04)
> > 
> > The machine is running Fedora 18 and audio goes over HDMI to a 5.1
> > receiver.  I'm not really sure what other info you might need, but
> > let me know if you need something else or have any patches you would
> > like me to test.
> 
> OK, it's the first time to get a bug report about this.
> Could you tell me how did you test it (i.e. which application, which
> sound backend)?  Can you confirm that it's reproduced via speaker-test
> program in alsa-utils package?

I originally noticed the problem when all of the dialog started coming
out of my rear left speaker in MythTV after the kernel update.  Then I
started using the Gnome 3 sound configuration gui in the system
settings which has a speaker test and I assume is using pulseaudio.
Running 'speaker-test -c6 -l1 -twav' also reproduces the problem.

For reference here are the versions of the various packages that I'm
running:

alsa-utils-1.0.26-1.fc18.x86_64
alsa-firmware-1.0.25-2.fc18.noarch
alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64
alsa-lib-devel-1.0.26-2.fc18.x86_64
alsa-lib-1.0.26-2.fc18.x86_64
alsa-tools-firmware-1.0.26.1-1.fc18.x86_64
pulseaudio-gdm-hooks-2.1-5.fc18.x86_64
pulseaudio-libs-2.1-5.fc18.x86_64
pulseaudio-libs-glib2-2.1-5.fc18.x86_64
pulseaudio-module-x11-2.1-5.fc18.x86_64
pulseaudio-module-bluetooth-2.1-5.fc18.x86_64
pulseaudio-2.1-5.fc18.x86_64
pulseaudio-utils-2.1-5.fc18.x86_64

> For further debugging, please give the following:
> - alsa-info.sh output while playing 5.1 sound

upload=true=true=
!!
!!ALSA Information Script v 0.4.60
!!

!!Script ran on: Tue Jan 29 02:39:13 UTC 2013


!!Linux Distribution
!!--

Fedora release 18 (Spherical Cow) Fedora release 18 (Spherical Cow) NAME=Fedora 
ID=fedora PRETTY_NAME="Fedora 18 (Spherical Cow)" 
CPE_NAME="cpe:/o:fedoraproject:fedora:18" Fedora release 18 (Spherical Cow) 
Fedora release 18 (Spherical Cow)


!!DMI Information
!!---

Manufacturer:  To Be Filled By O.E.M.
Product Name:  To Be Filled By O.E.M.
Product Version:   To Be Filled By O.E.M.


!!Kernel Information
!!--

Kernel release:3.7.2-204.fc18.x86_64
Operating System:  GNU/Linux
Architecture:  x86_64
Processor: x86_64
SMP Enabled:   Yes


!!ALSA Version
!!

Driver version: k3.7.2-204.fc18.x86_64
Library version:1.0.26
Utilities version:  1.0.26


!!Loaded ALSA modules
!!---

snd_hda_intel


!!Sound Servers on this system
!!

Pulseaudio:
  Installed - Yes (/usr/bin/pulseaudio)
  Running - Yes

Jack:
  Installed - Yes (/usr/bin/jackd)
  Running - No


!!Soundcards recognised by ALSA
!!-

 0 [PCH]: HDA-Intel - HDA Intel PCH
  HDA Intel PCH at 0xf7d1 irq 46


!!PCI Soundcards installed in the system
!!--

00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family 
High Definition Audio Controller (rev 04)


!!Advanced information - PCI Vendor/Device/Subsystem ID's
!!

00:1b.0 0403: 8086:1e20 (rev 04)
Subsystem: 1849:1898


!!Loaded sound module options
!!--

!!Module: snd_hda_intel
align_buffer_size : -1
bdl_pos_adj : 
1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
beep_mode : 
N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N
enable : Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y
enable_msi : -1
id : 
(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null)
index : 
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,

Re: 3.7 HDMI channel map regression

2013-01-28 Thread Shawn Bohrer
On Mon, Jan 28, 2013 at 09:56:33AM +0100, Takashi Iwai wrote:
 At Sun, 27 Jan 2013 19:18:27 -0600,
 Shawn Bohrer wrote:
  
  Hi Takashi,
  
  I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
  and FC channels to swap, and my RR and LFE channels to swap for PCM
  audio.  Doing a git bisect identified
  d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the
  proper channel mapping for generic HDMI driver as the commit that
  caused my channels to swap.  The commit doesn't revert cleanly on
  3.7.4, and I haven't really looked to see what the correct fix might
  be.
  
  Some info that may be relevant, the sound card is a:
  
  00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
  Family High Definition Audio Controller (rev 04)
  
  The machine is running Fedora 18 and audio goes over HDMI to a 5.1
  receiver.  I'm not really sure what other info you might need, but
  let me know if you need something else or have any patches you would
  like me to test.
 
 OK, it's the first time to get a bug report about this.
 Could you tell me how did you test it (i.e. which application, which
 sound backend)?  Can you confirm that it's reproduced via speaker-test
 program in alsa-utils package?

I originally noticed the problem when all of the dialog started coming
out of my rear left speaker in MythTV after the kernel update.  Then I
started using the Gnome 3 sound configuration gui in the system
settings which has a speaker test and I assume is using pulseaudio.
Running 'speaker-test -c6 -l1 -twav' also reproduces the problem.

For reference here are the versions of the various packages that I'm
running:

alsa-utils-1.0.26-1.fc18.x86_64
alsa-firmware-1.0.25-2.fc18.noarch
alsa-plugins-pulseaudio-1.0.26-2.fc18.x86_64
alsa-lib-devel-1.0.26-2.fc18.x86_64
alsa-lib-1.0.26-2.fc18.x86_64
alsa-tools-firmware-1.0.26.1-1.fc18.x86_64
pulseaudio-gdm-hooks-2.1-5.fc18.x86_64
pulseaudio-libs-2.1-5.fc18.x86_64
pulseaudio-libs-glib2-2.1-5.fc18.x86_64
pulseaudio-module-x11-2.1-5.fc18.x86_64
pulseaudio-module-bluetooth-2.1-5.fc18.x86_64
pulseaudio-2.1-5.fc18.x86_64
pulseaudio-utils-2.1-5.fc18.x86_64

 For further debugging, please give the following:
 - alsa-info.sh output while playing 5.1 sound

upload=truescript=truecardinfo=
!!
!!ALSA Information Script v 0.4.60
!!

!!Script ran on: Tue Jan 29 02:39:13 UTC 2013


!!Linux Distribution
!!--

Fedora release 18 (Spherical Cow) Fedora release 18 (Spherical Cow) NAME=Fedora 
ID=fedora PRETTY_NAME=Fedora 18 (Spherical Cow) 
CPE_NAME=cpe:/o:fedoraproject:fedora:18 Fedora release 18 (Spherical Cow) 
Fedora release 18 (Spherical Cow)


!!DMI Information
!!---

Manufacturer:  To Be Filled By O.E.M.
Product Name:  To Be Filled By O.E.M.
Product Version:   To Be Filled By O.E.M.


!!Kernel Information
!!--

Kernel release:3.7.2-204.fc18.x86_64
Operating System:  GNU/Linux
Architecture:  x86_64
Processor: x86_64
SMP Enabled:   Yes


!!ALSA Version
!!

Driver version: k3.7.2-204.fc18.x86_64
Library version:1.0.26
Utilities version:  1.0.26


!!Loaded ALSA modules
!!---

snd_hda_intel


!!Sound Servers on this system
!!

Pulseaudio:
  Installed - Yes (/usr/bin/pulseaudio)
  Running - Yes

Jack:
  Installed - Yes (/usr/bin/jackd)
  Running - No


!!Soundcards recognised by ALSA
!!-

 0 [PCH]: HDA-Intel - HDA Intel PCH
  HDA Intel PCH at 0xf7d1 irq 46


!!PCI Soundcards installed in the system
!!--

00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family 
High Definition Audio Controller (rev 04)


!!Advanced information - PCI Vendor/Device/Subsystem ID's
!!

00:1b.0 0403: 8086:1e20 (rev 04)
Subsystem: 1849:1898


!!Loaded sound module options
!!--

!!Module: snd_hda_intel
align_buffer_size : -1
bdl_pos_adj : 
1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
beep_mode : 
N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N
enable : Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y
enable_msi : -1
id : 
(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null)
index : 
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
model : 
(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null

3.7 HDMI channel map regression

2013-01-27 Thread Shawn Bohrer
Hi Takashi,

I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
and FC channels to swap, and my RR and LFE channels to swap for PCM
audio.  Doing a git bisect identified
d45e6889ee69456a4d5b1bbb32252f460cd48fa9 "ALSA: hda - Provide the
proper channel mapping for generic HDMI driver" as the commit that
caused my channels to swap.  The commit doesn't revert cleanly on
3.7.4, and I haven't really looked to see what the correct fix might
be.

Some info that may be relevant, the sound card is a:

00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
Family High Definition Audio Controller (rev 04)

The machine is running Fedora 18 and audio goes over HDMI to a 5.1
receiver.  I'm not really sure what other info you might need, but
let me know if you need something else or have any patches you would
like me to test.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.7 HDMI channel map regression

2013-01-27 Thread Shawn Bohrer
Hi Takashi,

I recently updated my HTPC from 3.6.11 to 3.7.2 and this caused my RL
and FC channels to swap, and my RR and LFE channels to swap for PCM
audio.  Doing a git bisect identified
d45e6889ee69456a4d5b1bbb32252f460cd48fa9 ALSA: hda - Provide the
proper channel mapping for generic HDMI driver as the commit that
caused my channels to swap.  The commit doesn't revert cleanly on
3.7.4, and I haven't really looked to see what the correct fix might
be.

Some info that may be relevant, the sound card is a:

00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset
Family High Definition Audio Controller (rev 04)

The machine is running Fedora 18 and audio goes over HDMI to a 5.1
receiver.  I'm not really sure what other info you might need, but
let me know if you need something else or have any patches you would
like me to test.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched_rt: Use root_domain of rt_rq not current processor

2013-01-14 Thread Shawn Bohrer
When the system has multiple domains do_sched_rt_period_timer() can run
on any CPU and may iterate over all rt_rq in cpu_online_mask.  This
means when balance_runtime() is run for a given rt_rq that rt_rq may be
in a different rd than the current processor.  Thus if we use
smp_processor_id() to get rd in do_balance_runtime() we may borrow
runtime from a rt_rq that is not part of our rd.

This changes do_balance_runtime to get the rd from the passed in rt_rq
ensuring that we borrow runtime only from the correct rd for the given
rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we
try reclaim runtime lent to other rt_rq but runtime has been lent to
a rt_rq in another rd.

Signed-off-by: Shawn Bohrer 
---
 kernel/sched/rt.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..4f02b28 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -566,7 +566,7 @@ static inline struct rt_bandwidth 
*sched_rt_bandwidth(struct rt_rq *rt_rq)
 static int do_balance_runtime(struct rt_rq *rt_rq)
 {
struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-   struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
+   struct root_domain *rd = rq_of_rt_rq(rt_rq)->rd;
int i, weight, more = 0;
u64 rt_period;
 
-- 
1.7.7.6


-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched_rt: Use root_domain of rt_rq not current processor

2013-01-14 Thread Shawn Bohrer
When the system has multiple domains do_sched_rt_period_timer() can run
on any CPU and may iterate over all rt_rq in cpu_online_mask.  This
means when balance_runtime() is run for a given rt_rq that rt_rq may be
in a different rd than the current processor.  Thus if we use
smp_processor_id() to get rd in do_balance_runtime() we may borrow
runtime from a rt_rq that is not part of our rd.

This changes do_balance_runtime to get the rd from the passed in rt_rq
ensuring that we borrow runtime only from the correct rd for the given
rt_rq.

This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we
try reclaim runtime lent to other rt_rq but runtime has been lent to
a rt_rq in another rd.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
---
 kernel/sched/rt.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..4f02b28 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -566,7 +566,7 @@ static inline struct rt_bandwidth 
*sched_rt_bandwidth(struct rt_rq *rt_rq)
 static int do_balance_runtime(struct rt_rq *rt_rq)
 {
struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-   struct root_domain *rd = cpu_rq(smp_processor_id())-rd;
+   struct root_domain *rd = rq_of_rt_rq(rt_rq)-rd;
int i, weight, more = 0;
u64 rt_period;
 
-- 
1.7.7.6


-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Shawn Bohrer
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote:
> On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: 
> > On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
> > > > 
> > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
> > > > is still present in the latest kernel.
> > > 
> > > Shawn,
> > > 
> > > Can you send me your .config file.
> > 
> > I've attached the 3.8.0-rc2 config that I used to reproduce this in an
> > 8 core kvm image.  Let me know if you need anything else.
> 
> I tried beating on my little Q6600 with no success.  I even tried
> setting the entire box rt, GUI and all, nada.
> 
> Hm, maybe re-installing systemd..

I don't know if Steve has had any success.  I can reproduce this easily
now so I'm happy to do some debugging if anyone has some things they
want me to try.

Here is some info on my setup at the moment.  I'm using an 8 core KVM
image now with an xfs file system.  We do use systemd if that is
relevant.  My cpuset controller is mounted on /cgroup/cpuset and we
use libcgroup-tools to move everything on the system that can be moved
into /cgroup/cpuset/sysdefault/  I've also boosted all kworker threads
to run as SCHED_FIFO with a priority of 51.  From there I just drop
the three attached shell scripts (burn.sh, sched_domain_bug.sh and
sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as
root.  Usually the bug triggers in less than a minute.  You may need
to tweak my shell scripts if your setup is different but they are very
rudimentary.

In order to try digging up some more info I applied the following
patch, and triggered the bug a few times.  The results are always
essentially the same:

---
 kernel/sched/rt.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..fba7f01 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq)
 * we lend and now have to reclaim.
 */
want = rt_b->rt_runtime - rt_rq->rt_runtime;
+   printk(KERN_INFO "Initial want: %lld rt_b->rt_runtime: %llu 
rt_rq->rt_runtime: %llu\n",
+  want, rt_b->rt_runtime, rt_rq->rt_runtime);
 
/*
 * Greedy reclaim, take back as much as we can.
@@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq)
 * We cannot be left wanting - that would mean some runtime
 * leaked out of the system.
 */
-   BUG_ON(want);
+   if (want) {
+   printk(KERN_ERR "BUG triggered, want: %lld\n", want);
+   for_each_cpu(i, rd->span) {
+   print_rt_stats(NULL, i);
+   }
+   }
 balanced:
/*
 * Disable all the borrow logic by pretending we have inf
---

Here is the output:

[   81.278842] SysRq : Changing Loglevel
[   81.279027] Loglevel set to 9
[   83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   87.287435] Initial want: 1 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 85000
[   87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 10
[   89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 8
[   89.292940] Initial want: 1 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 85000
[   89.294082] Initial want: 1 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 85000
[   89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   90.959004] [sched_delayed] sched: RT throttling activated
[   91.289470] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.292767] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.294037] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.295364] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.296355] BUG triggered, want: 2
[   91.296355] 
[   91.296355] rt_rq[7]:
[   91.296355]   .rt_nr_running : 0
[   91.29635

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Shawn Bohrer
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote:
 On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: 
  On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:

I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
is still present in the latest kernel.
   
   Shawn,
   
   Can you send me your .config file.
  
  I've attached the 3.8.0-rc2 config that I used to reproduce this in an
  8 core kvm image.  Let me know if you need anything else.
 
 I tried beating on my little Q6600 with no success.  I even tried
 setting the entire box rt, GUI and all, nada.
 
 Hm, maybe re-installing systemd..

I don't know if Steve has had any success.  I can reproduce this easily
now so I'm happy to do some debugging if anyone has some things they
want me to try.

Here is some info on my setup at the moment.  I'm using an 8 core KVM
image now with an xfs file system.  We do use systemd if that is
relevant.  My cpuset controller is mounted on /cgroup/cpuset and we
use libcgroup-tools to move everything on the system that can be moved
into /cgroup/cpuset/sysdefault/  I've also boosted all kworker threads
to run as SCHED_FIFO with a priority of 51.  From there I just drop
the three attached shell scripts (burn.sh, sched_domain_bug.sh and
sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as
root.  Usually the bug triggers in less than a minute.  You may need
to tweak my shell scripts if your setup is different but they are very
rudimentary.

In order to try digging up some more info I applied the following
patch, and triggered the bug a few times.  The results are always
essentially the same:

---
 kernel/sched/rt.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..fba7f01 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq)
 * we lend and now have to reclaim.
 */
want = rt_b-rt_runtime - rt_rq-rt_runtime;
+   printk(KERN_INFO Initial want: %lld rt_b-rt_runtime: %llu 
rt_rq-rt_runtime: %llu\n,
+  want, rt_b-rt_runtime, rt_rq-rt_runtime);
 
/*
 * Greedy reclaim, take back as much as we can.
@@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq)
 * We cannot be left wanting - that would mean some runtime
 * leaked out of the system.
 */
-   BUG_ON(want);
+   if (want) {
+   printk(KERN_ERR BUG triggered, want: %lld\n, want);
+   for_each_cpu(i, rd-span) {
+   print_rt_stats(NULL, i);
+   }
+   }
 balanced:
/*
 * Disable all the borrow logic by pretending we have inf
---

Here is the output:

[   81.278842] SysRq : Changing Loglevel
[   81.279027] Loglevel set to 9
[   83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   87.287435] Initial want: 1 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 85000
[   87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 10
[   89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 8
[   89.292940] Initial want: 1 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 85000
[   89.294082] Initial want: 1 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 85000
[   89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   90.959004] [sched_delayed] sched: RT throttling activated
[   91.289470] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.292767] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.294037] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.295364] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.296355] BUG triggered, want: 2
[   91.296355] 
[   91.296355] rt_rq[7]:
[   91.296355]   .rt_nr_running : 0
[   91.296355]   .rt_throttled  : 0
[   91.296355]   .rt_time   : 0.00
[   91.296355]   .rt_runtime: 750.00
[   91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 10
[   91.308440] Initial want: -1 rt_b-rt_runtime: 95000 
rt_rq

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Shawn Bohrer
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
> > 
> > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
> > is still present in the latest kernel.
> 
> Shawn,
> 
> Can you send me your .config file.

I've attached the 3.8.0-rc2 config that I used to reproduce this in an
8 core kvm image.  Let me know if you need anything else.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.


3.8.0-rc2.config.gz
Description: GNU Zip compressed data


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Shawn Bohrer
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
  
  I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
  is still present in the latest kernel.
 
 Shawn,
 
 Can you send me your .config file.

I've attached the 3.8.0-rc2 config that I used to reproduce this in an
8 core kvm image.  Let me know if you need anything else.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.


3.8.0-rc2.config.gz
Description: GNU Zip compressed data


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote:
> On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
> > I've tried reproducing the issue, but so far I've been unsuccessful
> > but I believe that is because my RT tasks aren't using enough CPU
> > cause borrowing from the other runqueues.  Normally our RT tasks use
> > very little CPU so I'm not entirely sure what conditions caused them
> > to run into throttling on the day that this happened.
> 
> I've managed to reproduce this a couple times now on 3.1.9 I'll give
> this a try later with a more recent kernel.  Here is what I've done to
> reproduce the issue.
> 
> 
> # Setup in shell 1
> root@berbox39:/cgroup/cpuset# mkdir package0
> root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems
> root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus
> root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
> 1
> root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
> 1
> root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
> 1
> root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus
> root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems
> root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks
> 
> # Setup in shell 2
> root@berbox39:~# cd /cgroup/cpuset/
> root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
> root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks
> 
> # In shell 1
> root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
> root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
> 
> # In shell 2
> root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
> root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
> 
> I haven't found the exact magic combination but I've been going back
> and forth adding/killing burn.sh processes and toggling
> cpuset.sched_load_balance and in a couple of minutes I can usually get
> the machine to trigger the bug.

I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
is still present in the latest kernel.

Also just re-reading my instructions above /root/burn.sh is just a
simple:

while true; do : ; done

I've also had to make the kworker threads SCHED_FIFO with a higher
priority than burn.sh or as expected I can lock up the system due to
some xfs threads getting starved.

Let me know if anyone needs any more information, or needs me to try
anything since it appears I can trigger this fairly easily now.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
> I've tried reproducing the issue, but so far I've been unsuccessful
> but I believe that is because my RT tasks aren't using enough CPU
> cause borrowing from the other runqueues.  Normally our RT tasks use
> very little CPU so I'm not entirely sure what conditions caused them
> to run into throttling on the day that this happened.

I've managed to reproduce this a couple times now on 3.1.9 I'll give
this a try later with a more recent kernel.  Here is what I've done to
reproduce the issue.


# Setup in shell 1
root@berbox39:/cgroup/cpuset# mkdir package0
root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems
root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus
root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus
root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems
root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks

# Setup in shell 2
root@berbox39:~# cd /cgroup/cpuset/
root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks

# In shell 1
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &

# In shell 2
root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance

I haven't found the exact magic combination but I've been going back
and forth adding/killing burn.sh processes and toggling
cpuset.sched_load_balance and in a couple of minutes I can usually get
the machine to trigger the bug.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
 I've tried reproducing the issue, but so far I've been unsuccessful
 but I believe that is because my RT tasks aren't using enough CPU
 cause borrowing from the other runqueues.  Normally our RT tasks use
 very little CPU so I'm not entirely sure what conditions caused them
 to run into throttling on the day that this happened.

I've managed to reproduce this a couple times now on 3.1.9 I'll give
this a try later with a more recent kernel.  Here is what I've done to
reproduce the issue.


# Setup in shell 1
root@berbox39:/cgroup/cpuset# mkdir package0
root@berbox39:/cgroup/cpuset# echo 0  package0/cpuset.mems
root@berbox39:/cgroup/cpuset# echo 0,2,4,6  package0/cpuset.cpus
root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# echo 1,3,5,7  sysdefault/cpuset.cpus
root@berbox39:/cgroup/cpuset# echo 0  sysdefault/cpuset.mems
root@berbox39:/cgroup/cpuset# echo $$  package0/tasks

# Setup in shell 2
root@berbox39:~# cd /cgroup/cpuset/
root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
root@berbox39:/cgroup/cpuset# echo $$  sysdefault/tasks

# In shell 1
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 

# In shell 2
root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance

I haven't found the exact magic combination but I've been going back
and forth adding/killing burn.sh processes and toggling
cpuset.sched_load_balance and in a couple of minutes I can usually get
the machine to trigger the bug.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote:
 On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
  I've tried reproducing the issue, but so far I've been unsuccessful
  but I believe that is because my RT tasks aren't using enough CPU
  cause borrowing from the other runqueues.  Normally our RT tasks use
  very little CPU so I'm not entirely sure what conditions caused them
  to run into throttling on the day that this happened.
 
 I've managed to reproduce this a couple times now on 3.1.9 I'll give
 this a try later with a more recent kernel.  Here is what I've done to
 reproduce the issue.
 
 
 # Setup in shell 1
 root@berbox39:/cgroup/cpuset# mkdir package0
 root@berbox39:/cgroup/cpuset# echo 0  package0/cpuset.mems
 root@berbox39:/cgroup/cpuset# echo 0,2,4,6  package0/cpuset.cpus
 root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
 1
 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
 1
 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
 1
 root@berbox39:/cgroup/cpuset# echo 1,3,5,7  sysdefault/cpuset.cpus
 root@berbox39:/cgroup/cpuset# echo 0  sysdefault/cpuset.mems
 root@berbox39:/cgroup/cpuset# echo $$  package0/tasks
 
 # Setup in shell 2
 root@berbox39:~# cd /cgroup/cpuset/
 root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
 root@berbox39:/cgroup/cpuset# echo $$  sysdefault/tasks
 
 # In shell 1
 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
 
 # In shell 2
 root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
 root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
 root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
 root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
 
 I haven't found the exact magic combination but I've been going back
 and forth adding/killing burn.sh processes and toggling
 cpuset.sched_load_balance and in a couple of minutes I can usually get
 the machine to trigger the bug.

I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
is still present in the latest kernel.

Also just re-reading my instructions above /root/burn.sh is just a
simple:

while true; do : ; done

I've also had to make the kworker threads SCHED_FIFO with a higher
priority than burn.sh or as expected I can lock up the system due to
some xfs threads getting starved.

Let me know if anyone needs any more information, or needs me to try
anything since it appears I can trigger this fairly easily now.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kernel BUG at kernel/sched_rt.c:493!

2013-01-05 Thread Shawn Bohrer
We recently managed to crash 10 of our test machines at the same time.
Half of the machines were running a 3.1.9 kernel and half were running
3.4.9.  I realize that these are both fairly old kernels but I've
skimmed the list of fixes in the 3.4.* stable series and didn't see
anything that appeared to be relevant to this issue.

All we managed to get was some screenshots of the stacks from the
consoles. On one of the 3.1.9 machines you can see we hit the
BUG_ON(want) statement in __disable_runtime() at
kernel/sched_rt.c:493, and all of the machines had essentially the
same stack showing:

rt_offline_rt
rq_attach_root
cpu_attach_domain
partition_sched_domains
do_rebuild_sched_domains

Here is one of the screenshots of the 3.1.9 machines:

https://dl.dropbox.com/u/84066079/berbox38.png

And here is one from a 3.4.9 machine:

https://dl.dropbox.com/u/84066079/berbox18.png

Three of the five 3.4.9 machines also managed to print
"[sched_delayed] sched: RT throttling activated" ~7 minutes before the
machines locked up.

I've tried reproducing the issue, but so far I've been unsuccessful
but I believe that is because my RT tasks aren't using enough CPU
cause borrowing from the other runqueues.  Normally our RT tasks use
very little CPU so I'm not entirely sure what conditions caused them
to run into throttling on the day that this happened.

The details that I do know about the workload that caused this are as
follows.

1) These are all dual socket 4 core X5460 systems with no
hyperthreading.  Thus there are 8 cores total in the system.
2) We use the cpuset cgroup to apply CPU affinity to various types of
processes.  Initially everything starts out in a single cpuset and the
top level cpuset has cpuset.sched_load_balance=1 thus there is only a
single scheduling domain.
3) In this case tasks were then placed into four non overlapping
cpusets.  1 containing a single core and single SCHED_FIFO task, 2
containing two cores, and multiple SCHED_FIFO tasks, and 1 containing
3 cores and everything else on the system running as SCHED_OTHER.
4) In the case of cpusets that contain SCHED_FIFO tasks, the tasks
start out as SCHED_OTHER are placed into the cpuset then change their
policy to SCHED_FIFO.
5) Once all tasks are placed into non overlapping cpusets the top
level cpuset.sched_load_balance is set to 0 to split the system into
four scheduling domains.
6) The system ran like this for some unknown amount of time.
7) All the processes are then sent a signal to exit, and at the same
time the top level cpuset.sched_load_balance is set back to 1.  This
is when the systems locked up.

Hopefully that is enough information to give someone more familiar
with the scheduler code an idea of where the bug is here.  I will
point out that in step #5 above there is a small window where the RT
tasks could encounter runtime limits but are still in a single big
scheduling domain.  I don't know if that is what happened or if it is
simply sufficient to hit the runtime limits while the system is split
into four domains.  For the curious we are using the default RT
runtime limits:

# grep . /proc/sys/kernel/sched_rt_*
/proc/sys/kernel/sched_rt_period_us:100
/proc/sys/kernel/sched_rt_runtime_us:95

Let me know if you anyone needs any more information about this issue.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kernel BUG at kernel/sched_rt.c:493!

2013-01-05 Thread Shawn Bohrer
We recently managed to crash 10 of our test machines at the same time.
Half of the machines were running a 3.1.9 kernel and half were running
3.4.9.  I realize that these are both fairly old kernels but I've
skimmed the list of fixes in the 3.4.* stable series and didn't see
anything that appeared to be relevant to this issue.

All we managed to get was some screenshots of the stacks from the
consoles. On one of the 3.1.9 machines you can see we hit the
BUG_ON(want) statement in __disable_runtime() at
kernel/sched_rt.c:493, and all of the machines had essentially the
same stack showing:

rt_offline_rt
rq_attach_root
cpu_attach_domain
partition_sched_domains
do_rebuild_sched_domains

Here is one of the screenshots of the 3.1.9 machines:

https://dl.dropbox.com/u/84066079/berbox38.png

And here is one from a 3.4.9 machine:

https://dl.dropbox.com/u/84066079/berbox18.png

Three of the five 3.4.9 machines also managed to print
[sched_delayed] sched: RT throttling activated ~7 minutes before the
machines locked up.

I've tried reproducing the issue, but so far I've been unsuccessful
but I believe that is because my RT tasks aren't using enough CPU
cause borrowing from the other runqueues.  Normally our RT tasks use
very little CPU so I'm not entirely sure what conditions caused them
to run into throttling on the day that this happened.

The details that I do know about the workload that caused this are as
follows.

1) These are all dual socket 4 core X5460 systems with no
hyperthreading.  Thus there are 8 cores total in the system.
2) We use the cpuset cgroup to apply CPU affinity to various types of
processes.  Initially everything starts out in a single cpuset and the
top level cpuset has cpuset.sched_load_balance=1 thus there is only a
single scheduling domain.
3) In this case tasks were then placed into four non overlapping
cpusets.  1 containing a single core and single SCHED_FIFO task, 2
containing two cores, and multiple SCHED_FIFO tasks, and 1 containing
3 cores and everything else on the system running as SCHED_OTHER.
4) In the case of cpusets that contain SCHED_FIFO tasks, the tasks
start out as SCHED_OTHER are placed into the cpuset then change their
policy to SCHED_FIFO.
5) Once all tasks are placed into non overlapping cpusets the top
level cpuset.sched_load_balance is set to 0 to split the system into
four scheduling domains.
6) The system ran like this for some unknown amount of time.
7) All the processes are then sent a signal to exit, and at the same
time the top level cpuset.sched_load_balance is set back to 1.  This
is when the systems locked up.

Hopefully that is enough information to give someone more familiar
with the scheduler code an idea of where the bug is here.  I will
point out that in step #5 above there is a small window where the RT
tasks could encounter runtime limits but are still in a single big
scheduling domain.  I don't know if that is what happened or if it is
simply sufficient to hit the runtime limits while the system is split
into four domains.  For the curious we are using the default RT
runtime limits:

# grep . /proc/sys/kernel/sched_rt_*
/proc/sys/kernel/sched_rt_period_us:100
/proc/sys/kernel/sched_rt_runtime_us:95

Let me know if you anyone needs any more information about this issue.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mlx4_en_alloc_frag allocation failures

2012-09-28 Thread Shawn Bohrer
On Fri, Sep 28, 2012 at 05:50:08PM +0200, Eric Dumazet wrote:
> On Fri, 2012-09-28 at 10:14 -0500, Shawn Bohrer wrote:
> > We've got a new application that is receiving UDP multicast data using
> > AF_PACKET and writing out the packets in a custom format to disk.  The
> > packet rates are bursty, but it seems to be roughly 100 Mbps on
> > average for 1 minute periods.  With this application running all day
> > we get a lot of these messages:
> > 
> > [1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020
> > [1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 
> > #1
> > [1298269.103040] Call Trace:
> > [1298269.103041][] warn_alloc_failed+0xf6/0x160
> > [1298269.103053]  [] ? skb_copy_bits+0x16d/0x2c0
> > [1298269.103058]  [] ? wakeup_kswapd+0x69/0x160
> > [1298269.103060]  [] __alloc_pages_nodemask+0x6e8/0x930
> > [1298269.103064]  [] alloc_pages_current+0xb6/0x120
> > [1298269.103070]  [] mlx4_en_alloc_frag+0x16b/0x1e0 
> > [mlx4_en]
> > [1298269.103073]  [] mlx4_en_complete_rx_desc+0x120/0x1d0 
> > [mlx4_en]
> > [1298269.103076]  [] mlx4_en_process_rx_cq+0x584/0x700 
> > [mlx4_en]
> > [1298269.103079]  [] mlx4_en_poll_rx_cq+0x3f/0x80 
> > [mlx4_en]
> > [1298269.103083]  [] net_rx_action+0x119/0x210
> > [1298269.103086]  [] __do_softirq+0xb0/0x220
> > [1298269.103090]  [] ? handle_irq_event+0x4d/0x70
> > [1298269.103095]  [] call_softirq+0x1c/0x30
> > [1298269.103100]  [] do_softirq+0x55/0x90
> > [1298269.103101]  [] irq_exit+0x75/0x80
> > [1298269.103103]  [] do_IRQ+0x63/0xe0
> > [1298269.103107]  [] common_interrupt+0x67/0x67
> > [1298269.103108][] ? 
> > _raw_spin_unlock_irqrestore+0xf/0x20
> > [1298269.103113]  [] compaction_alloc+0x361/0x3f0
> > [1298269.103115]  [] ? pagevec_lru_move_fn+0xd7/0xf0
> > [1298269.103118]  [] migrate_pages+0xa9/0x470
> > [1298269.103120]  [] ? 
> > perf_trace_mm_compaction_migratepages+0xd0/0xd0
> > [1298269.103122]  [] compact_zone+0x4cb/0x910
> > [1298269.103124]  [] __compact_pgdat+0x14b/0x190
> > [1298269.103125]  [] compact_pgdat+0x2d/0x30
> > [1298269.103129]  [] ? fragmentation_index+0x19/0x70
> > [1298269.103131]  [] balance_pgdat+0x6ef/0x710
> > [1298269.103133]  [] kswapd+0x14a/0x390
> > [1298269.103136]  [] ? add_wait_queue+0x60/0x60
> > [1298269.103138]  [] ? balance_pgdat+0x710/0x710
> > [1298269.103140]  [] kthread+0x93/0xa0
> > [1298269.103142]  [] kernel_thread_helper+0x4/0x10
> > [1298269.103144]  [] ? kthread_worker_fn+0x140/0x140
> > [1298269.103146]  [] ? gs_change+0xb/0xb
> > 
> > The kernel is based on a Fedora 16 kernel and actually has the 3.4.10
> > patches applied.  I can easily test patches or different kernels.
> > 
> > I'm mostly wondering if there is anything that can be done about these
> > failures?  It appears that these failures have to do with handling
> > fragmented IP frames, but the majority of the packets this machines
> > should not be fragmented (there are probably some that are).
> > 
> > From a memory management point of view the system has 48GB of RAM, and
> > typically 44GB of that is page cache.  The dirty pages seem to hover
> > around 5-6MB and the filesystem/disks don't seem to have any problems
> > keeping up with writing out the data.
> 
> What is the value of /proc/sys/vm/min_free_kbytes ?

$ cat /proc/sys/vm/min_free_kbytes
90112

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


mlx4_en_alloc_frag allocation failures

2012-09-28 Thread Shawn Bohrer
We've got a new application that is receiving UDP multicast data using
AF_PACKET and writing out the packets in a custom format to disk.  The
packet rates are bursty, but it seems to be roughly 100 Mbps on
average for 1 minute periods.  With this application running all day
we get a lot of these messages:

[1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020
[1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 #1
[1298269.103040] Call Trace:
[1298269.103041][] warn_alloc_failed+0xf6/0x160
[1298269.103053]  [] ? skb_copy_bits+0x16d/0x2c0
[1298269.103058]  [] ? wakeup_kswapd+0x69/0x160
[1298269.103060]  [] __alloc_pages_nodemask+0x6e8/0x930
[1298269.103064]  [] alloc_pages_current+0xb6/0x120
[1298269.103070]  [] mlx4_en_alloc_frag+0x16b/0x1e0 [mlx4_en]
[1298269.103073]  [] mlx4_en_complete_rx_desc+0x120/0x1d0 
[mlx4_en]
[1298269.103076]  [] mlx4_en_process_rx_cq+0x584/0x700 
[mlx4_en]
[1298269.103079]  [] mlx4_en_poll_rx_cq+0x3f/0x80 [mlx4_en]
[1298269.103083]  [] net_rx_action+0x119/0x210
[1298269.103086]  [] __do_softirq+0xb0/0x220
[1298269.103090]  [] ? handle_irq_event+0x4d/0x70
[1298269.103095]  [] call_softirq+0x1c/0x30
[1298269.103100]  [] do_softirq+0x55/0x90
[1298269.103101]  [] irq_exit+0x75/0x80
[1298269.103103]  [] do_IRQ+0x63/0xe0
[1298269.103107]  [] common_interrupt+0x67/0x67
[1298269.103108][] ? 
_raw_spin_unlock_irqrestore+0xf/0x20
[1298269.103113]  [] compaction_alloc+0x361/0x3f0
[1298269.103115]  [] ? pagevec_lru_move_fn+0xd7/0xf0
[1298269.103118]  [] migrate_pages+0xa9/0x470
[1298269.103120]  [] ? 
perf_trace_mm_compaction_migratepages+0xd0/0xd0
[1298269.103122]  [] compact_zone+0x4cb/0x910
[1298269.103124]  [] __compact_pgdat+0x14b/0x190
[1298269.103125]  [] compact_pgdat+0x2d/0x30
[1298269.103129]  [] ? fragmentation_index+0x19/0x70
[1298269.103131]  [] balance_pgdat+0x6ef/0x710
[1298269.103133]  [] kswapd+0x14a/0x390
[1298269.103136]  [] ? add_wait_queue+0x60/0x60
[1298269.103138]  [] ? balance_pgdat+0x710/0x710
[1298269.103140]  [] kthread+0x93/0xa0
[1298269.103142]  [] kernel_thread_helper+0x4/0x10
[1298269.103144]  [] ? kthread_worker_fn+0x140/0x140
[1298269.103146]  [] ? gs_change+0xb/0xb

The kernel is based on a Fedora 16 kernel and actually has the 3.4.10
patches applied.  I can easily test patches or different kernels.

I'm mostly wondering if there is anything that can be done about these
failures?  It appears that these failures have to do with handling
fragmented IP frames, but the majority of the packets this machines
should not be fragmented (there are probably some that are).

>From a memory management point of view the system has 48GB of RAM, and
typically 44GB of that is page cache.  The dirty pages seem to hover
around 5-6MB and the filesystem/disks don't seem to have any problems
keeping up with writing out the data.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


mlx4_en_alloc_frag allocation failures

2012-09-28 Thread Shawn Bohrer
We've got a new application that is receiving UDP multicast data using
AF_PACKET and writing out the packets in a custom format to disk.  The
packet rates are bursty, but it seems to be roughly 100 Mbps on
average for 1 minute periods.  With this application running all day
we get a lot of these messages:

[1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020
[1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 #1
[1298269.103040] Call Trace:
[1298269.103041]  IRQ  [810db746] warn_alloc_failed+0xf6/0x160
[1298269.103053]  [813c767d] ? skb_copy_bits+0x16d/0x2c0
[1298269.103058]  [810e83a9] ? wakeup_kswapd+0x69/0x160
[1298269.103060]  [810df188] __alloc_pages_nodemask+0x6e8/0x930
[1298269.103064]  [81114316] alloc_pages_current+0xb6/0x120
[1298269.103070]  [a00c142b] mlx4_en_alloc_frag+0x16b/0x1e0 [mlx4_en]
[1298269.103073]  [a00c18a0] mlx4_en_complete_rx_desc+0x120/0x1d0 
[mlx4_en]
[1298269.103076]  [a00c27d4] mlx4_en_process_rx_cq+0x584/0x700 
[mlx4_en]
[1298269.103079]  [a00c29ef] mlx4_en_poll_rx_cq+0x3f/0x80 [mlx4_en]
[1298269.103083]  [813d6569] net_rx_action+0x119/0x210
[1298269.103086]  [8103c690] __do_softirq+0xb0/0x220
[1298269.103090]  [8109911d] ? handle_irq_event+0x4d/0x70
[1298269.103095]  [8148e30c] call_softirq+0x1c/0x30
[1298269.103100]  [81003ef5] do_softirq+0x55/0x90
[1298269.103101]  [8103ca65] irq_exit+0x75/0x80
[1298269.103103]  [8148e853] do_IRQ+0x63/0xe0
[1298269.103107]  [81485667] common_interrupt+0x67/0x67
[1298269.103108]  EOI  [8148523f] ? 
_raw_spin_unlock_irqrestore+0xf/0x20
[1298269.103113]  [811184b1] compaction_alloc+0x361/0x3f0
[1298269.103115]  [810e29b7] ? pagevec_lru_move_fn+0xd7/0xf0
[1298269.103118]  [81123d19] migrate_pages+0xa9/0x470
[1298269.103120]  [81118150] ? 
perf_trace_mm_compaction_migratepages+0xd0/0xd0
[1298269.103122]  [81118abb] compact_zone+0x4cb/0x910
[1298269.103124]  [8111904b] __compact_pgdat+0x14b/0x190
[1298269.103125]  [8111931d] compact_pgdat+0x2d/0x30
[1298269.103129]  [810f32b9] ? fragmentation_index+0x19/0x70
[1298269.103131]  [810eb15f] balance_pgdat+0x6ef/0x710
[1298269.103133]  [810eb2ca] kswapd+0x14a/0x390
[1298269.103136]  [810567c0] ? add_wait_queue+0x60/0x60
[1298269.103138]  [810eb180] ? balance_pgdat+0x710/0x710
[1298269.103140]  [81055e93] kthread+0x93/0xa0
[1298269.103142]  [8148e214] kernel_thread_helper+0x4/0x10
[1298269.103144]  [81055e00] ? kthread_worker_fn+0x140/0x140
[1298269.103146]  [8148e210] ? gs_change+0xb/0xb

The kernel is based on a Fedora 16 kernel and actually has the 3.4.10
patches applied.  I can easily test patches or different kernels.

I'm mostly wondering if there is anything that can be done about these
failures?  It appears that these failures have to do with handling
fragmented IP frames, but the majority of the packets this machines
should not be fragmented (there are probably some that are).

From a memory management point of view the system has 48GB of RAM, and
typically 44GB of that is page cache.  The dirty pages seem to hover
around 5-6MB and the filesystem/disks don't seem to have any problems
keeping up with writing out the data.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mlx4_en_alloc_frag allocation failures

2012-09-28 Thread Shawn Bohrer
On Fri, Sep 28, 2012 at 05:50:08PM +0200, Eric Dumazet wrote:
 On Fri, 2012-09-28 at 10:14 -0500, Shawn Bohrer wrote:
  We've got a new application that is receiving UDP multicast data using
  AF_PACKET and writing out the packets in a custom format to disk.  The
  packet rates are bursty, but it seems to be roughly 100 Mbps on
  average for 1 minute periods.  With this application running all day
  we get a lot of these messages:
  
  [1298269.103034] kswapd1: page allocation failure: order:2, mode:0x4020
  [1298269.103038] Pid: 80, comm: kswapd1 Not tainted 3.4.9-2.rgm.fc16.x86_64 
  #1
  [1298269.103040] Call Trace:
  [1298269.103041]  IRQ  [810db746] warn_alloc_failed+0xf6/0x160
  [1298269.103053]  [813c767d] ? skb_copy_bits+0x16d/0x2c0
  [1298269.103058]  [810e83a9] ? wakeup_kswapd+0x69/0x160
  [1298269.103060]  [810df188] __alloc_pages_nodemask+0x6e8/0x930
  [1298269.103064]  [81114316] alloc_pages_current+0xb6/0x120
  [1298269.103070]  [a00c142b] mlx4_en_alloc_frag+0x16b/0x1e0 
  [mlx4_en]
  [1298269.103073]  [a00c18a0] mlx4_en_complete_rx_desc+0x120/0x1d0 
  [mlx4_en]
  [1298269.103076]  [a00c27d4] mlx4_en_process_rx_cq+0x584/0x700 
  [mlx4_en]
  [1298269.103079]  [a00c29ef] mlx4_en_poll_rx_cq+0x3f/0x80 
  [mlx4_en]
  [1298269.103083]  [813d6569] net_rx_action+0x119/0x210
  [1298269.103086]  [8103c690] __do_softirq+0xb0/0x220
  [1298269.103090]  [8109911d] ? handle_irq_event+0x4d/0x70
  [1298269.103095]  [8148e30c] call_softirq+0x1c/0x30
  [1298269.103100]  [81003ef5] do_softirq+0x55/0x90
  [1298269.103101]  [8103ca65] irq_exit+0x75/0x80
  [1298269.103103]  [8148e853] do_IRQ+0x63/0xe0
  [1298269.103107]  [81485667] common_interrupt+0x67/0x67
  [1298269.103108]  EOI  [8148523f] ? 
  _raw_spin_unlock_irqrestore+0xf/0x20
  [1298269.103113]  [811184b1] compaction_alloc+0x361/0x3f0
  [1298269.103115]  [810e29b7] ? pagevec_lru_move_fn+0xd7/0xf0
  [1298269.103118]  [81123d19] migrate_pages+0xa9/0x470
  [1298269.103120]  [81118150] ? 
  perf_trace_mm_compaction_migratepages+0xd0/0xd0
  [1298269.103122]  [81118abb] compact_zone+0x4cb/0x910
  [1298269.103124]  [8111904b] __compact_pgdat+0x14b/0x190
  [1298269.103125]  [8111931d] compact_pgdat+0x2d/0x30
  [1298269.103129]  [810f32b9] ? fragmentation_index+0x19/0x70
  [1298269.103131]  [810eb15f] balance_pgdat+0x6ef/0x710
  [1298269.103133]  [810eb2ca] kswapd+0x14a/0x390
  [1298269.103136]  [810567c0] ? add_wait_queue+0x60/0x60
  [1298269.103138]  [810eb180] ? balance_pgdat+0x710/0x710
  [1298269.103140]  [81055e93] kthread+0x93/0xa0
  [1298269.103142]  [8148e214] kernel_thread_helper+0x4/0x10
  [1298269.103144]  [81055e00] ? kthread_worker_fn+0x140/0x140
  [1298269.103146]  [8148e210] ? gs_change+0xb/0xb
  
  The kernel is based on a Fedora 16 kernel and actually has the 3.4.10
  patches applied.  I can easily test patches or different kernels.
  
  I'm mostly wondering if there is anything that can be done about these
  failures?  It appears that these failures have to do with handling
  fragmented IP frames, but the majority of the packets this machines
  should not be fragmented (there are probably some that are).
  
  From a memory management point of view the system has 48GB of RAM, and
  typically 44GB of that is page cache.  The dirty pages seem to hover
  around 5-6MB and the filesystem/disks don't seem to have any problems
  keeping up with writing out the data.
 
 What is the value of /proc/sys/vm/min_free_kbytes ?

$ cat /proc/sys/vm/min_free_kbytes
90112

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] CodingStyle updates

2007-09-29 Thread Shawn Bohrer
On Fri, Sep 28, 2007 at 05:32:00PM -0400, Erez Zadok wrote:
> 1. Updates chapter 13 (printing kernel messages) to expand on the use of
>pr_debug()/pr_info(), what to avoid, and how to hook your debug code with
>kernel.h.
> 
> 2. New chapter 19, branch prediction optimizations, discusses the whole
>un/likely issue.
> 
> Cc: "Kok, Auke" <[EMAIL PROTECTED]>
> Cc: Kyle Moffett <[EMAIL PROTECTED]>
> Cc: Jan Engelhardt <[EMAIL PROTECTED]>
> Cc: Adrian Bunk <[EMAIL PROTECTED]>
> Cc: roel <[EMAIL PROTECTED]>
> 
> Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
> ---
>  Documentation/CodingStyle |   88 +++-
>  1 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle
> index 7f1730f..00b29e4 100644
> --- a/Documentation/CodingStyle
> +++ b/Documentation/CodingStyle
> @@ -643,8 +643,26 @@ Printing numbers in parentheses (%d) adds no value and 
> should be avoided.
>  There are a number of driver model diagnostic macros in 
>  which you should use to make sure messages are matched to the right device
>  and driver, and are tagged with the right level:  dev_err(), dev_warn(),
> -dev_info(), and so forth.  For messages that aren't associated with a
> -particular device,  defines pr_debug() and pr_info().
> +dev_info(), and so forth.
> +
> +A number of people often like to define their own debugging printf's,
> +wrapping printk's in #ifdef's that get turned on only when subsystem
> +debugging is compiled in (e.g., dprintk, Dprintk, DPRINTK, etc.).  Please
> +don't reinvent the wheel but use existing mechanisms.  For messages that
> +aren't associated with a particular device,  defines
> +pr_debug() and pr_info(); the latter two translate to printk(KERN_DEBUG) and

The latter two?  Since there are only two presented I think there is no
reason to say "latter".

> +printk(KERN_INFO), respectively.  However, to get pr_debug() to actually
> +emit the message, you'll need to turn on DEBUG in your code, which can be
> +done as follows in your subsystem Makefile:
> +
> +ifeq ($(CONFIG_WHATEVER_DEBUG),y)
> +EXTRA_CFLAGS += -DDEBUG
> +endif
> +
> +In this way, you can create a Kconfig parameter to turn on debugging at
> +compile time, which will also turn on DEBUG, to enable pr_debug() to emit
> +actual messages; conversely, when CONFIG_WHATEVER_DEBUG is off, DEBUG is
> +off, and pr_debug() will display nothing.
>  
>  Coming up with good debugging messages can be quite a challenge; and once
>  you have them, they can be a huge help for remote troubleshooting.  Such
> @@ -779,6 +797,69 @@ includes markers for indentation and mode configuration. 
>  People may use their
>  own custom mode, or may have some other magic method for making indentation
>  work correctly.
>  
> + Chapter 19: branch prediction optimizations
> +
> +The kernel includes macros called likely() and unlikely(), which can be used
> +as hints to the compiler to optimize branch prediction.  They operate by
> +asking gcc to shuffle the code around so that the more favorable outcome
> +executes linearly, avoiding a JMP instruction; this can improve cache
> +pipeline efficiency.  For technical details how these macros work, see the
> +References section at the end of this document.
> +
> +An example use of this as as follows:

  ^^

> +
> + ptr = kmalloc(size, GFP_KERNEL);
> + if (unlikely(!ptr))
> + ...
> +
> +or
> + err = some_function(...);
> + if (likely(!err))
> + ...

--
Shawn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] CodingStyle updates

2007-09-29 Thread Shawn Bohrer
On Fri, Sep 28, 2007 at 05:32:00PM -0400, Erez Zadok wrote:
 1. Updates chapter 13 (printing kernel messages) to expand on the use of
pr_debug()/pr_info(), what to avoid, and how to hook your debug code with
kernel.h.
 
 2. New chapter 19, branch prediction optimizations, discusses the whole
un/likely issue.
 
 Cc: Kok, Auke [EMAIL PROTECTED]
 Cc: Kyle Moffett [EMAIL PROTECTED]
 Cc: Jan Engelhardt [EMAIL PROTECTED]
 Cc: Adrian Bunk [EMAIL PROTECTED]
 Cc: roel [EMAIL PROTECTED]
 
 Signed-off-by: Erez Zadok [EMAIL PROTECTED]
 ---
  Documentation/CodingStyle |   88 +++-
  1 files changed, 86 insertions(+), 2 deletions(-)
 
 diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle
 index 7f1730f..00b29e4 100644
 --- a/Documentation/CodingStyle
 +++ b/Documentation/CodingStyle
 @@ -643,8 +643,26 @@ Printing numbers in parentheses (%d) adds no value and 
 should be avoided.
  There are a number of driver model diagnostic macros in linux/device.h
  which you should use to make sure messages are matched to the right device
  and driver, and are tagged with the right level:  dev_err(), dev_warn(),
 -dev_info(), and so forth.  For messages that aren't associated with a
 -particular device, linux/kernel.h defines pr_debug() and pr_info().
 +dev_info(), and so forth.
 +
 +A number of people often like to define their own debugging printf's,
 +wrapping printk's in #ifdef's that get turned on only when subsystem
 +debugging is compiled in (e.g., dprintk, Dprintk, DPRINTK, etc.).  Please
 +don't reinvent the wheel but use existing mechanisms.  For messages that
 +aren't associated with a particular device, linux/kernel.h defines
 +pr_debug() and pr_info(); the latter two translate to printk(KERN_DEBUG) and

The latter two?  Since there are only two presented I think there is no
reason to say latter.

 +printk(KERN_INFO), respectively.  However, to get pr_debug() to actually
 +emit the message, you'll need to turn on DEBUG in your code, which can be
 +done as follows in your subsystem Makefile:
 +
 +ifeq ($(CONFIG_WHATEVER_DEBUG),y)
 +EXTRA_CFLAGS += -DDEBUG
 +endif
 +
 +In this way, you can create a Kconfig parameter to turn on debugging at
 +compile time, which will also turn on DEBUG, to enable pr_debug() to emit
 +actual messages; conversely, when CONFIG_WHATEVER_DEBUG is off, DEBUG is
 +off, and pr_debug() will display nothing.
  
  Coming up with good debugging messages can be quite a challenge; and once
  you have them, they can be a huge help for remote troubleshooting.  Such
 @@ -779,6 +797,69 @@ includes markers for indentation and mode configuration. 
  People may use their
  own custom mode, or may have some other magic method for making indentation
  work correctly.
  
 + Chapter 19: branch prediction optimizations
 +
 +The kernel includes macros called likely() and unlikely(), which can be used
 +as hints to the compiler to optimize branch prediction.  They operate by
 +asking gcc to shuffle the code around so that the more favorable outcome
 +executes linearly, avoiding a JMP instruction; this can improve cache
 +pipeline efficiency.  For technical details how these macros work, see the
 +References section at the end of this document.
 +
 +An example use of this as as follows:

  ^^

 +
 + ptr = kmalloc(size, GFP_KERNEL);
 + if (unlikely(!ptr))
 + ...
 +
 +or
 + err = some_function(...);
 + if (likely(!err))
 + ...

--
Shawn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-17 Thread Shawn Bohrer
On Tue, Jul 17, 2007 at 02:57:45AM +0200, Rene Herman wrote:
>  True enough. I'm rather wondering though why RHEL is shipping with it if 
>  it's a _real_ problem. Scribbling junk all over kernel memory would be the 
>  kind of thing I'd imagine you'd mightely piss-off enterprise customers with. 
>  But well, sure, that rather quickly becomes a self-referential argument I 
>  guess.

I can't speak for Fedora, but RHEL disables XFS in their kernel likely
because it is known to cause problems with 4K stacks.

>  Well, no. "oldconfig" works fine, and other than that, all failure modes 
>  I've heard about also in this thread are MD/LVM/XFS. This is extremely 
>  widely tested stuff in at least Fedora and RHEL.

Again don't assume that because Fedora and RHEL have 4K stacks means
that MD/LVM/XFS is widely tested.

Additionally I think I should point out that the problems pointed out so
far are not the only problem areas with 4K stacks.  There are out of
tree drivers to consider as well, and use cases like ndiswrapper.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-17 Thread Shawn Bohrer
On Tue, Jul 17, 2007 at 02:57:45AM +0200, Rene Herman wrote:
  True enough. I'm rather wondering though why RHEL is shipping with it if 
  it's a _real_ problem. Scribbling junk all over kernel memory would be the 
  kind of thing I'd imagine you'd mightely piss-off enterprise customers with. 
  But well, sure, that rather quickly becomes a self-referential argument I 
  guess.

I can't speak for Fedora, but RHEL disables XFS in their kernel likely
because it is known to cause problems with 4K stacks.

  Well, no. oldconfig works fine, and other than that, all failure modes 
  I've heard about also in this thread are MD/LVM/XFS. This is extremely 
  widely tested stuff in at least Fedora and RHEL.

Again don't assume that because Fedora and RHEL have 4K stacks means
that MD/LVM/XFS is widely tested.

Additionally I think I should point out that the problems pointed out so
far are not the only problem areas with 4K stacks.  There are out of
tree drivers to consider as well, and use cases like ndiswrapper.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/