Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Mon, Aug 13, 2012 at 02:35:46PM -0600, Jim Schutt wrote: > Hi Mel, > > On 08/12/2012 02:22 PM, Mel Gorman wrote: > > > > >I went through the patch again but only found the following which is a > >weak candidate. Still, can you retest with the following patch on top and > >CONFIG_PROVE_LOCKING set please? > > > > I've gotten in several hours of testing on this patch with > no issues at all, and no output from CONFIG_PROVE_LOCKING > (I'm assuming it would show up on a serial console). So, > it seems to me this patch has done the trick. > Super. > CPU utilization is staying under control, and write-out rate > is good. > Even better. > You can add my Tested-by: as you see fit. If you work > up any refinements and would like me to test, please > let me know. > I'll be adding your Tested-by and I'll keep you cc'd on the series. It'll look a little different because I'm expect to adjust it slightly to match Andrew's tree but there should be no major surprises and my expectation is that testing a -rc kernel after it gets merged is all that is necessary. I'm planning to backport this to -stable but it'll remain to be seen if I can convince the relevant maintainers that it should be merged. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Mon, Aug 13, 2012 at 02:35:46PM -0600, Jim Schutt wrote: Hi Mel, On 08/12/2012 02:22 PM, Mel Gorman wrote: I went through the patch again but only found the following which is a weak candidate. Still, can you retest with the following patch on top and CONFIG_PROVE_LOCKING set please? I've gotten in several hours of testing on this patch with no issues at all, and no output from CONFIG_PROVE_LOCKING (I'm assuming it would show up on a serial console). So, it seems to me this patch has done the trick. Super. CPU utilization is staying under control, and write-out rate is good. Even better. You can add my Tested-by: as you see fit. If you work up any refinements and would like me to test, please let me know. I'll be adding your Tested-by and I'll keep you cc'd on the series. It'll look a little different because I'm expect to adjust it slightly to match Andrew's tree but there should be no major surprises and my expectation is that testing a -rc kernel after it gets merged is all that is necessary. I'm planning to backport this to -stable but it'll remain to be seen if I can convince the relevant maintainers that it should be merged. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/12/2012 02:22 PM, Mel Gorman wrote: I went through the patch again but only found the following which is a weak candidate. Still, can you retest with the following patch on top and CONFIG_PROVE_LOCKING set please? I've gotten in several hours of testing on this patch with no issues at all, and no output from CONFIG_PROVE_LOCKING (I'm assuming it would show up on a serial console). So, it seems to me this patch has done the trick. CPU utilization is staying under control, and write-out rate is good. You can add my Tested-by: as you see fit. If you work up any refinements and would like me to test, please let me know. Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/12/2012 02:22 PM, Mel Gorman wrote: I went through the patch again but only found the following which is a weak candidate. Still, can you retest with the following patch on top and CONFIG_PROVE_LOCKING set please? I've gotten in several hours of testing on this patch with no issues at all, and no output from CONFIG_PROVE_LOCKING (I'm assuming it would show up on a serial console). So, it seems to me this patch has done the trick. CPU utilization is staying under control, and write-out rate is good. You can add my Tested-by: as you see fit. If you work up any refinements and would like me to test, please let me know. Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Fri, Aug 10, 2012 at 11:20:07AM -0600, Jim Schutt wrote: > On 08/10/2012 05:02 AM, Mel Gorman wrote: > >On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: > > >>> > >>>Ok, this is an untested hack and I expect it would drop allocation success > >>>rates again under load (but not as much). Can you test again and see what > >>>effect, if any, it has please? > >>> > >>>---8<--- > >>>mm: compaction: back out if contended > >>> > >>>--- > >> > >> > >> > >>Initial testing with this patch looks very good from > >>my perspective; CPU utilization stays reasonable, > >>write-out rate stays high, no signs of stress. > >>Here's an example after ~10 minutes under my test load: > >> > > Hmmm, I wonder if I should have tested this patch longer, > in view of the trouble I ran into testing the new patch? > See below. > The two patches are quite different in what they do. I think it's unlikely they would share a common bug. > > > >---8<--- > >mm: compaction: Abort async compaction if locks are contended or taking too > >long > > > Hmmm, while testing this patch, a couple of my servers got > stuck after ~30 minutes or so, like this: > > [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds. > [ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 2515.884447] ceph-osdD 0 30375 1 > 0x > [ 2515.891531] 8802e1a99e38 0082 88056b38e298 > 8802e1a99fd8 > [ 2515.899013] 8802e1a98010 8802e1a98000 8802e1a98000 > 8802e1a98000 > [ 2515.906482] 8802e1a99fd8 8802e1a98000 880697d31700 > 8802e1a84500 > [ 2515.913968] Call Trace: > [ 2515.916433] [] schedule+0x5d/0x60 > [ 2515.921417] [] rwsem_down_failed_common+0x105/0x140 > [ 2515.927938] [] rwsem_down_write_failed+0x13/0x20 > [ 2515.934195] [] call_rwsem_down_write_failed+0x13/0x20 > [ 2515.940934] [] ? down_write+0x45/0x50 > [ 2515.946244] [] sys_mprotect+0xd2/0x240 > [ 2515.951640] [] system_call_fastpath+0x16/0x1b > > > I tried to capture a perf trace while this was going on, but it > never completed. "ps" on this system reports lots of kernel threads > and some user-space stuff, but hangs part way through - no ceph > executables in the output, oddly. > ps is probably locking up because it's trying to access a proc file for a process that is not releasing the mmap_sem. > I can retest your earlier patch for a longer period, to > see if it does the same thing, or I can do some other thing > if you tell me what it is. > > Also, FWIW I sorted a little through SysRq-T output from such > a system; these bits looked interesting: > > [ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17} (t=6 > jiffies) > [ 3663.685099] sending NMI to all CPUs: > [ 3663.685101] NMI backtrace for cpu 0 > [ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm > ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 > dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net > macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm > crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode > serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en > mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 > i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod > ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb > dca e1000 [last unloaded: scsi_wait_scan] > [ 3663.685138] > [ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a > #221 Supermicro X8DTH-i/6/iF/6F/X8DTH > [ 3663.685142] RIP: 0010:[] [] > _raw_spin_lock_irqsave+0x45/0x60 > [ 3663.685148] RSP: 0018:880a08191898 EFLAGS: 0012 > [ 3663.685149] RAX: 88063fffcb00 RBX: 88063fffcb00 RCX: > 00c5 > [ 3663.685149] RDX: 00bf RSI: 015a RDI: > 88063fffcb00 > [ 3663.685150] RBP: 880a081918a8 R08: R09: > > [ 3663.685151] R10: 88063fffcb98 R11: 88063fffcc38 R12: > 0246 > [ 3663.685152] R13: 88063fffcba8 R14: 88063fffcb90 R15: > 88063fffc680 > [ 3663.685153] FS: 7fff90ae0700() GS:880627c0() > knlGS: > [ 3663.685154] CS: 0010 DS: ES: CR0: 8005003b > [ 3663.685155] CR2: ff600400 CR3: 0002b8fbe000 CR4: > 07f0 > [ 3663.685156] DR0: DR1: DR2: > > [ 3663.685157] DR3: DR6: 0ff0 DR7: > 0400 > [ 3663.685158] Process ceph-osd (pid: 100027, threadinfo 880a0819, > task 880a9a29ae00) > [ 3663.685158] Stack: > [ 3663.685159] 130a 880a08191948 > 8111a760 > [ 3663.685162] 81a13420 0009
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Fri, Aug 10, 2012 at 11:20:07AM -0600, Jim Schutt wrote: On 08/10/2012 05:02 AM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8--- mm: compaction: back out if contended --- snip Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: Hmmm, I wonder if I should have tested this patch longer, in view of the trouble I ran into testing the new patch? See below. The two patches are quite different in what they do. I think it's unlikely they would share a common bug. SNIP ---8--- mm: compaction: Abort async compaction if locks are contended or taking too long Hmmm, while testing this patch, a couple of my servers got stuck after ~30 minutes or so, like this: [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds. [ 2515.876630] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2515.884447] ceph-osdD 0 30375 1 0x [ 2515.891531] 8802e1a99e38 0082 88056b38e298 8802e1a99fd8 [ 2515.899013] 8802e1a98010 8802e1a98000 8802e1a98000 8802e1a98000 [ 2515.906482] 8802e1a99fd8 8802e1a98000 880697d31700 8802e1a84500 [ 2515.913968] Call Trace: [ 2515.916433] [8147fded] schedule+0x5d/0x60 [ 2515.921417] [81480b25] rwsem_down_failed_common+0x105/0x140 [ 2515.927938] [81480b73] rwsem_down_write_failed+0x13/0x20 [ 2515.934195] [8124bcd3] call_rwsem_down_write_failed+0x13/0x20 [ 2515.940934] [8147edc5] ? down_write+0x45/0x50 [ 2515.946244] [81127b62] sys_mprotect+0xd2/0x240 [ 2515.951640] [81489412] system_call_fastpath+0x16/0x1b SNIP I tried to capture a perf trace while this was going on, but it never completed. ps on this system reports lots of kernel threads and some user-space stuff, but hangs part way through - no ceph executables in the output, oddly. ps is probably locking up because it's trying to access a proc file for a process that is not releasing the mmap_sem. I can retest your earlier patch for a longer period, to see if it does the same thing, or I can do some other thing if you tell me what it is. Also, FWIW I sorted a little through SysRq-T output from such a system; these bits looked interesting: [ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17} (t=6 jiffies) [ 3663.685099] sending NMI to all CPUs: [ 3663.685101] NMI backtrace for cpu 0 [ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan] [ 3663.685138] [ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH [ 3663.685142] RIP: 0010:[81480ed5] [81480ed5] _raw_spin_lock_irqsave+0x45/0x60 [ 3663.685148] RSP: 0018:880a08191898 EFLAGS: 0012 [ 3663.685149] RAX: 88063fffcb00 RBX: 88063fffcb00 RCX: 00c5 [ 3663.685149] RDX: 00bf RSI: 015a RDI: 88063fffcb00 [ 3663.685150] RBP: 880a081918a8 R08: R09: [ 3663.685151] R10: 88063fffcb98 R11: 88063fffcc38 R12: 0246 [ 3663.685152] R13: 88063fffcba8 R14: 88063fffcb90 R15: 88063fffc680 [ 3663.685153] FS: 7fff90ae0700() GS:880627c0() knlGS: [ 3663.685154] CS: 0010 DS: ES: CR0: 8005003b [ 3663.685155] CR2: ff600400 CR3: 0002b8fbe000 CR4: 07f0 [ 3663.685156] DR0: DR1: DR2: [ 3663.685157] DR3: DR6: 0ff0 DR7: 0400 [ 3663.685158] Process ceph-osd (pid: 100027, threadinfo 880a0819, task 880a9a29ae00) [ 3663.685158] Stack: [ 3663.685159] 130a 880a08191948 8111a760 [ 3663.685162] 81a13420 0009 ea04c240
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/10/2012 05:02 AM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8<--- mm: compaction: back out if contended --- Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: Hmmm, I wonder if I should have tested this patch longer, in view of the trouble I ran into testing the new patch? See below. Excellent, so it is contention that is the problem. I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. It does impair allocation success rates as I expected (they're still ok but not as high as I'd like) so I implemented the following instead. It attempts to backoff when contention is detected or compaction is taking too long. It does not backoff as quickly as the first prototype did so I'd like to see if it addresses your problem or not. I really appreciate getting the chance to test out your patchset. I appreciate that you have a workload that demonstrates the problem and will test patches. I will not abuse this and hope the keep the revisions to a minimum. Thanks. ---8<--- mm: compaction: Abort async compaction if locks are contended or taking too long Hmmm, while testing this patch, a couple of my servers got stuck after ~30 minutes or so, like this: [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds. [ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2515.884447] ceph-osdD 0 30375 1 0x [ 2515.891531] 8802e1a99e38 0082 88056b38e298 8802e1a99fd8 [ 2515.899013] 8802e1a98010 8802e1a98000 8802e1a98000 8802e1a98000 [ 2515.906482] 8802e1a99fd8 8802e1a98000 880697d31700 8802e1a84500 [ 2515.913968] Call Trace: [ 2515.916433] [] schedule+0x5d/0x60 [ 2515.921417] [] rwsem_down_failed_common+0x105/0x140 [ 2515.927938] [] rwsem_down_write_failed+0x13/0x20 [ 2515.934195] [] call_rwsem_down_write_failed+0x13/0x20 [ 2515.940934] [] ? down_write+0x45/0x50 [ 2515.946244] [] sys_mprotect+0xd2/0x240 [ 2515.951640] [] system_call_fastpath+0x16/0x1b [ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds. [ 2515.964330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2515.972141] ceph-osdD 0 95698 1 0x [ 2515.979223] 8802b049fe38 0082 88056b38e2a0 8802b049ffd8 [ 2515.986700] 8802b049e010 8802b049e000 8802b049e000 8802b049e000 [ 2515.994176] 8802b049ffd8 8802b049e000 8809832ddc00 880611592e00 [ 2516.001653] Call Trace: [ 2516.004111] [] schedule+0x5d/0x60 [ 2516.009072] [] rwsem_down_failed_common+0x105/0x140 [ 2516.015589] [] rwsem_down_write_failed+0x13/0x20 [ 2516.021861] [] call_rwsem_down_write_failed+0x13/0x20 [ 2516.028555] [] ? down_write+0x45/0x50 [ 2516.033859] [] sys_mprotect+0xd2/0x240 [ 2516.039248] [] system_call_fastpath+0x16/0x1b [ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds. [ 2516.051934] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2516.059753] ceph-osdD 0 95699 1 0x [ 2516.066832] 880c022d3dc8 0082 880c022d2000 880c022d3fd8 [ 2516.074302] 880c022d2010 880c022d2000 880c022d2000 880c022d2000 [ 2516.081784] 880c022d3fd8 880c022d2000 8806224cc500 88096b64dc00 [ 2516.089254] Call Trace: [ 2516.091702] [] schedule+0x5d/0x60 [ 2516.096656] [] rwsem_down_failed_common+0x105/0x140 [ 2516.103176] [] rwsem_down_write_failed+0x13/0x20 [ 2516.109443] [] call_rwsem_down_write_failed+0x13/0x20 [ 2516.116134] [] ? down_write+0x45/0x50 [ 2516.121442] [] vm_mmap_pgoff+0x6e/0xb0 [ 2516.126861] [] sys_mmap_pgoff+0x18a/0x190 [ 2516.132552] [] ? trace_hardirqs_on_thunk+0x3a/0x3c [ 2516.138985] [] sys_mmap+0x22/0x30 [ 2516.143945] [] system_call_fastpath+0x16/0x1b [ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds. [ 2516.156632] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2516.16] ceph-osdD 0 95816 1 0x [ 2516.171521] 880332991e38 0082 880332991de8 880332991fd8 [ 2516.178992] 880332990010 88033299 88033299 88033299 [ 2516.186466] 880332991fd8 88033299 880697d31700 880a92c32e00 [
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: > >> > > > >My conclusion looking at the vmstat data is that everything is looking ok > >until system CPU usage goes through the roof. I'm assuming that's what we > >are all still looking at. > > I'm concerned about both the high CPU usage as well as the > reduction in write-out rate, but I've been assuming the latter > is caused by the former. > Almost certainly. > > > > > >Ok, this is an untested hack and I expect it would drop allocation success > >rates again under load (but not as much). Can you test again and see what > >effect, if any, it has please? > > > >---8<--- > >mm: compaction: back out if contended > > > >--- > > > > Initial testing with this patch looks very good from > my perspective; CPU utilization stays reasonable, > write-out rate stays high, no signs of stress. > Here's an example after ~10 minutes under my test load: > Excellent, so it is contention that is the problem. > > I'll continue testing tomorrow to be sure nothing > shows up after continued testing. > > If this passes your allocation success rate testing, > I'm happy with this performance for 3.6 - if not, I'll > be happy to test any further patches. > It does impair allocation success rates as I expected (they're still ok but not as high as I'd like) so I implemented the following instead. It attempts to backoff when contention is detected or compaction is taking too long. It does not backoff as quickly as the first prototype did so I'd like to see if it addresses your problem or not. > I really appreciate getting the chance to test out > your patchset. > I appreciate that you have a workload that demonstrates the problem and will test patches. I will not abuse this and hope the keep the revisions to a minimum. Thanks. ---8<--- mm: compaction: Abort async compaction if locks are contended or taking too long Jim Schutt reported a problem that pointed at compaction contending heavily on locks. The workload is straight-forward and in his own words; The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. Early in the test everything looks fine procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 31 15 0 287216576 3860662800 2 11582 14 1 3 95 0 0 27 15 0 225288576 385833840018 016 203357 134876 11 56 17 15 0 28 17 0 219256576 385447360011 2305932 203141 146296 11 49 23 17 0 6 18 0 215596576 3855287200 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984576 3859640400 3 2445741 223114 179527 12 43 23 22 0 and then it goes to pot procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 163 8 0 464308576 367913680011 22210 866 536 3 13 79 4 0 207 14 0 917752576 3618192800 712 1345376 134598 47367 7 90 1 2 0 123 12 0 685516576 3629614800 429 1386615 158494 60077 8 84 5 3 0 123 12 0 598572576 3633372800 1107 1233281 147542 62351 7 84 5 4 0 622 7 0 660768576 3611826400 557 1345548 151394 59353 7 85 4 3 0 223 11 0 283960576 364638680046 1107160 121846 33006 6 93 1 1 0 Note that system CPU usage is very high blocks being written out has dropped by 42%. He analysed this with perf and found perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 34.63% [k] _raw_spin_lock_irqsave | |--97.30%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | compact_zone_order | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--87.39%-- skb_copy_datagram_iovec | |
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: SNIP My conclusion looking at the vmstat data is that everything is looking ok until system CPU usage goes through the roof. I'm assuming that's what we are all still looking at. I'm concerned about both the high CPU usage as well as the reduction in write-out rate, but I've been assuming the latter is caused by the former. Almost certainly. snip Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8--- mm: compaction: back out if contended --- snip Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: Excellent, so it is contention that is the problem. SNIP I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. It does impair allocation success rates as I expected (they're still ok but not as high as I'd like) so I implemented the following instead. It attempts to backoff when contention is detected or compaction is taking too long. It does not backoff as quickly as the first prototype did so I'd like to see if it addresses your problem or not. I really appreciate getting the chance to test out your patchset. I appreciate that you have a workload that demonstrates the problem and will test patches. I will not abuse this and hope the keep the revisions to a minimum. Thanks. ---8--- mm: compaction: Abort async compaction if locks are contended or taking too long Jim Schutt reported a problem that pointed at compaction contending heavily on locks. The workload is straight-forward and in his own words; The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. Early in the test everything looks fine procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 31 15 0 287216576 3860662800 2 11582 14 1 3 95 0 0 27 15 0 225288576 385833840018 016 203357 134876 11 56 17 15 0 28 17 0 219256576 385447360011 2305932 203141 146296 11 49 23 17 0 6 18 0 215596576 3855287200 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984576 3859640400 3 2445741 223114 179527 12 43 23 22 0 and then it goes to pot procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 163 8 0 464308576 367913680011 22210 866 536 3 13 79 4 0 207 14 0 917752576 3618192800 712 1345376 134598 47367 7 90 1 2 0 123 12 0 685516576 3629614800 429 1386615 158494 60077 8 84 5 3 0 123 12 0 598572576 3633372800 1107 1233281 147542 62351 7 84 5 4 0 622 7 0 660768576 3611826400 557 1345548 151394 59353 7 85 4 3 0 223 11 0 283960576 364638680046 1107160 121846 33006 6 93 1 1 0 Note that system CPU usage is very high blocks being written out has dropped by 42%. He analysed this with perf and found perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 34.63% [k] _raw_spin_lock_irqsave | |--97.30%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | compact_zone_order | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--87.39%-- skb_copy_datagram_iovec | | tcp_recvmsg | |
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/10/2012 05:02 AM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8--- mm: compaction: back out if contended --- snip Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: Hmmm, I wonder if I should have tested this patch longer, in view of the trouble I ran into testing the new patch? See below. Excellent, so it is contention that is the problem. SNIP I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. It does impair allocation success rates as I expected (they're still ok but not as high as I'd like) so I implemented the following instead. It attempts to backoff when contention is detected or compaction is taking too long. It does not backoff as quickly as the first prototype did so I'd like to see if it addresses your problem or not. I really appreciate getting the chance to test out your patchset. I appreciate that you have a workload that demonstrates the problem and will test patches. I will not abuse this and hope the keep the revisions to a minimum. Thanks. ---8--- mm: compaction: Abort async compaction if locks are contended or taking too long Hmmm, while testing this patch, a couple of my servers got stuck after ~30 minutes or so, like this: [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds. [ 2515.876630] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2515.884447] ceph-osdD 0 30375 1 0x [ 2515.891531] 8802e1a99e38 0082 88056b38e298 8802e1a99fd8 [ 2515.899013] 8802e1a98010 8802e1a98000 8802e1a98000 8802e1a98000 [ 2515.906482] 8802e1a99fd8 8802e1a98000 880697d31700 8802e1a84500 [ 2515.913968] Call Trace: [ 2515.916433] [8147fded] schedule+0x5d/0x60 [ 2515.921417] [81480b25] rwsem_down_failed_common+0x105/0x140 [ 2515.927938] [81480b73] rwsem_down_write_failed+0x13/0x20 [ 2515.934195] [8124bcd3] call_rwsem_down_write_failed+0x13/0x20 [ 2515.940934] [8147edc5] ? down_write+0x45/0x50 [ 2515.946244] [81127b62] sys_mprotect+0xd2/0x240 [ 2515.951640] [81489412] system_call_fastpath+0x16/0x1b [ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds. [ 2515.964330] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2515.972141] ceph-osdD 0 95698 1 0x [ 2515.979223] 8802b049fe38 0082 88056b38e2a0 8802b049ffd8 [ 2515.986700] 8802b049e010 8802b049e000 8802b049e000 8802b049e000 [ 2515.994176] 8802b049ffd8 8802b049e000 8809832ddc00 880611592e00 [ 2516.001653] Call Trace: [ 2516.004111] [8147fded] schedule+0x5d/0x60 [ 2516.009072] [81480b25] rwsem_down_failed_common+0x105/0x140 [ 2516.015589] [81480b73] rwsem_down_write_failed+0x13/0x20 [ 2516.021861] [8124bcd3] call_rwsem_down_write_failed+0x13/0x20 [ 2516.028555] [8147edc5] ? down_write+0x45/0x50 [ 2516.033859] [81127b62] sys_mprotect+0xd2/0x240 [ 2516.039248] [81489412] system_call_fastpath+0x16/0x1b [ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds. [ 2516.051934] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2516.059753] ceph-osdD 0 95699 1 0x [ 2516.066832] 880c022d3dc8 0082 880c022d2000 880c022d3fd8 [ 2516.074302] 880c022d2010 880c022d2000 880c022d2000 880c022d2000 [ 2516.081784] 880c022d3fd8 880c022d2000 8806224cc500 88096b64dc00 [ 2516.089254] Call Trace: [ 2516.091702] [8147fded] schedule+0x5d/0x60 [ 2516.096656] [81480b25] rwsem_down_failed_common+0x105/0x140 [ 2516.103176] [81480b73] rwsem_down_write_failed+0x13/0x20 [ 2516.109443] [8124bcd3] call_rwsem_down_write_failed+0x13/0x20 [ 2516.116134] [8147edc5] ? down_write+0x45/0x50 [ 2516.121442] [8111362e] vm_mmap_pgoff+0x6e/0xb0 [ 2516.126861] [8112486a] sys_mmap_pgoff+0x18a/0x190 [ 2516.132552] [8124bd6e] ? trace_hardirqs_on_thunk+0x3a/0x3c [ 2516.138985] [81006b22] sys_mmap+0x22/0x30 [ 2516.143945] [81489412] system_call_fastpath+0x16/0x1b [ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds. [ 2516.156632] echo 0
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 02:46 PM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote: On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order> 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order> 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: My conclusion looking at the vmstat data is that everything is looking ok until system CPU usage goes through the roof. I'm assuming that's what we are all still looking at. I'm concerned about both the high CPU usage as well as the reduction in write-out rate, but I've been assuming the latter is caused by the former. Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8<--- mm: compaction: back out if contended --- Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: 2012-08-09 16:26:07.550-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 21 19 0 351628576 378354400017 44394 1241 653 6 20 64 9 0 11 11 0 365520576 3789306000 124 2121508 203450 170957 12 46 25 17 0 13 16 0 359888576 379544560098 2185033 209473 171571 13 44 25 18 0 17 15 0 353728576 380105360089 2170971 208052 167988 13 43 26 18 0 17 16 0 349732576 3804828400 135 2217752 218754 174170 13 49 21 16 0 43 13 0 343280576 3804650000 153 2207135 217872 179519 13 47 23 18 0 26 13 0 350968576 3793718400 147 2189822 214276 176697 13 47 23 17 0 4 12 0 350080576 3795836400 226 2145212 207077 172163 12 44 24 20 0 15 13 0 353124576 3792104000 145 2078422 197231 166381 12 41 30 17 0 14 15 0 348964576 3794958800 107 2020853 188192 164064 12 39 30 20 0 21 9 0 354784576 3795122800 117 2148090 204307 165609 13 48 22 18 0 36 16 0 347368576 3798982400 166 2208681 216392 178114 13 47 24 16 0 28 15 0 300656576 3806091200 164 2181681 214618 175132 13 45 24 18 0 9 16 0 295484576 3809218400 153 2156909 218993 180289 13 43 27 17 0 17 16 0 346760576 3797900800 165 2124168 198730 173455 12 44 27 18 0 14 17 0 360988576 3795713600 142 2092248 197430 168199 12 42 29 17 0 I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. I really appreciate getting the chance to test out your patchset. Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote: > On 08/09/2012 07:49 AM, Mel Gorman wrote: > >Changelog since V2 > >o Capture !MIGRATE_MOVABLE pages where possible > >o Document the treatment of MIGRATE_MOVABLE pages while capturing > >o Expand changelogs > > > >Changelog since V1 > >o Dropped kswapd related patch, basically a no-op and regresses if fixed > >(minchan) > >o Expanded changelogs a little > > > >Allocation success rates have been far lower since 3.4 due to commit > >[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This > >commit was introduced for good reasons and it was known in advance that > >the success rates would suffer but it was justified on the grounds that > >the high allocation success rates were achieved by aggressive reclaim. > >Success rates are expected to suffer even more in 3.6 due to commit > >[7db8889a: mm: have order> 0 compaction start off where it left] which > >testing has shown to severely reduce allocation success rates under load - > >to 0% in one case. There is a proposed change to that patch in this series > >and it would be ideal if Jim Schutt could retest the workload that led to > >commit [7db8889a: mm: have order> 0 compaction start off where it left]. > > On my first test of this patch series on top of 3.5, I ran into an > instance of what I think is the sort of thing that patch 4/5 was > fixing. Here's what vmstat had to say during that period: > > My conclusion looking at the vmstat data is that everything is looking ok until system CPU usage goes through the roof. I'm assuming that's what we are all still looking at. I am still concerned that what patch 4/5 was actually doing was bypassing compaction almost entirely in the contended case which "works" but not exactly expected > And here's a perf report, captured/displayed with > perf record -g -a sleep 10 > perf report --sort symbol --call-graph fractal,5 > sometime during that period just after 12:00:09, when > the run queueu was > 100. > > -- > > Processed 0 events and LOST 1175296! > > > # > 34.63% [k] _raw_spin_lock_irqsave > | > |--97.30%-- isolate_freepages > | compaction_alloc > | unmap_and_move > | migrate_pages > | compact_zone > | compact_zone_order > | try_to_compact_pages > | __alloc_pages_direct_compact > | __alloc_pages_slowpath > | __alloc_pages_nodemask > | alloc_pages_vma > | do_huge_pmd_anonymous_page > | handle_mm_fault > | do_page_fault > | page_fault > | | > | |--87.39%-- skb_copy_datagram_iovec > | | tcp_recvmsg > | | inet_recvmsg > | | sock_recvmsg > | | sys_recvfrom > | | system_call > | | __recv > | | | > | | --100.00%-- (nil) > | | > | --12.61%-- memcpy > --2.70%-- [...] So lets just consider this. My interpretation of that is that we are receiving data from the network and copying it into a buffer that is faulted for the first time and backed by THP. All good so far *BUT* we are contending like crazy on the zone lock and probably blocking normal page allocations in the meantime. > > 14.31% [k] _raw_spin_lock_irq > | > |--98.08%-- isolate_migratepages_range This is a variation of the same problem but on the LRU lock this time. > > > -- > > If I understand what this is telling me, skb_copy_datagram_iovec > is responsible for triggering the calls to isolate_freepages_block, > isolate_migratepages_range, and isolate_freepages? > Sortof. I do not think it's the jumbo frames that are doing it, it's the faulting of the buffer it copies to. > FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames > and the Linux TCP stack (i.e., no stateful TCP offload). > Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8<--- mm: compaction: back out if contended --- include/linux/compaction.h |4 ++-- mm/compaction.c| 45 ++-- mm/internal.h |1 + mm/page_alloc.c| 13 + 4 files changed, 51 insertions(+), 12 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 5673459..9c94cba 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -22,7 +22,7 @@
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order> 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order> 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: -- 2012-08-09 11:58:04.107-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 20 14 0 235884576 389160720012 17047 171 133 3 8 85 4 0 18 17 0 220272576 389559120086 2131838 200142 162956 12 38 31 19 0 17 9 0 244284576 389553280019 2179562 213775 167901 13 43 26 18 0 27 15 0 223036576 389526400024 2202816 217996 158390 14 47 25 15 0 17 16 0 233124576 3895990800 5 2268815 224647 165728 14 50 21 15 0 16 13 0 225840576 389957400052 2253829 216797 160551 14 47 23 16 0 22 13 0 260584576 389829080092 2196737 211694 140924 14 53 19 15 0 16 10 0 235784576 389171280022 2157466 210022 137630 14 54 19 14 0 12 13 0 214300576 389238480031 2187735 213862 142711 14 52 20 14 0 25 12 0 219528576 389195400011 2066523 205256 142080 13 49 23 15 0 26 14 0 229460576 389137040049 2108654 200692 135447 13 51 21 15 0 11 11 0 220376576 388624560045 2136419 207493 146813 13 49 22 16 0 36 12 0 229860576 3886978400 7 2163463 212223 151812 14 47 25 14 0 16 13 0 238356576 388914960067 2251650 221728 154429 14 52 20 14 0 65 15 0 211536576 389221080059 2237925 224237 156587 14 53 19 14 0 24 13 0 585024576 386340240037 2240929 229040 148192 15 61 14 10 0 2012-08-09 11:59:04.714-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 43 8 0 794392576 383823160011 20491 576 420 3 10 82 4 0 127 6 0 579328576 384221560021 2006775 205582 119660 12 70 11 7 0 44 5 0 492860576 385123600046 1536525 173377 85320 10 78 7 4 0 218 9 0 585668576 382713200039 1257266 152869 64023 8 83 7 3 0 101 6 0 600168576 381281040010 1438705 160769 68374 9 84 5 3 0 62 5 0 597004576 380989720093 1376841 154012 63912 8 82 7 4 0 61 11 0 850396576 378087720046 1186816 145731 70453 7 78 9 6 0 124 7 0 437388576 381263200015 1208434 149736 57142 7 86 4 3 0 204 11 01105816576 373095320020 1327833 145979 52718 7 87 4 2 0 29 8 0 751020576 3736033200 8 1405474 169916 61982 9 85 4 2 0 38 7 0 626448576 373332440014 1328415 174665 74214 8 84 5 3 0 23 5 0 650040576 371342800028 1351209 179220 71631 8 85 5 2 0 40 10 0 610988576 3705429200 104 1272527 167530 73527 7 85 5 3 0 79 22 02076836576 3548734000 750 1249934 175420 70124 7 88 3 2 0 58 6
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Thu, Aug 09, 2012 at 08:36:12AM -0600, Jim Schutt wrote: > Hi Mel, > > On 08/09/2012 07:49 AM, Mel Gorman wrote: > >Changelog since V2 > >o Capture !MIGRATE_MOVABLE pages where possible > >o Document the treatment of MIGRATE_MOVABLE pages while capturing > >o Expand changelogs > > > >Changelog since V1 > >o Dropped kswapd related patch, basically a no-op and regresses if fixed > >(minchan) > >o Expanded changelogs a little > > > >Allocation success rates have been far lower since 3.4 due to commit > >[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This > >commit was introduced for good reasons and it was known in advance that > >the success rates would suffer but it was justified on the grounds that > >the high allocation success rates were achieved by aggressive reclaim. > >Success rates are expected to suffer even more in 3.6 due to commit > >[7db8889a: mm: have order> 0 compaction start off where it left] which > >testing has shown to severely reduce allocation success rates under load - > >to 0% in one case. There is a proposed change to that patch in this series > >and it would be ideal if Jim Schutt could retest the workload that led to > >commit [7db8889a: mm: have order> 0 compaction start off where it left]. > > I was successful at resolving my Ceph issue on 3.6-rc1, but ran > into some other issue that isn't immediately obvious, and prevents > me from testing your patch with 3.6-rc1. Today I will apply your > patch series to 3.5 and test that way. > > Sorry for the delay. > No need to be sorry at all. I appreciate you taking the time and as there were revisions since V1 you were better off waiting even if you did not have the Ceph issue! Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order> 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order> 0 compaction start off where it left]. I was successful at resolving my Ceph issue on 3.6-rc1, but ran into some other issue that isn't immediately obvious, and prevents me from testing your patch with 3.6-rc1. Today I will apply your patch series to 3.5 and test that way. Sorry for the delay. -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. I was successful at resolving my Ceph issue on 3.6-rc1, but ran into some other issue that isn't immediately obvious, and prevents me from testing your patch with 3.6-rc1. Today I will apply your patch series to 3.5 and test that way. Sorry for the delay. -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Thu, Aug 09, 2012 at 08:36:12AM -0600, Jim Schutt wrote: Hi Mel, On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. I was successful at resolving my Ceph issue on 3.6-rc1, but ran into some other issue that isn't immediately obvious, and prevents me from testing your patch with 3.6-rc1. Today I will apply your patch series to 3.5 and test that way. Sorry for the delay. No need to be sorry at all. I appreciate you taking the time and as there were revisions since V1 you were better off waiting even if you did not have the Ceph issue! Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: -- 2012-08-09 11:58:04.107-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 20 14 0 235884576 389160720012 17047 171 133 3 8 85 4 0 18 17 0 220272576 389559120086 2131838 200142 162956 12 38 31 19 0 17 9 0 244284576 389553280019 2179562 213775 167901 13 43 26 18 0 27 15 0 223036576 389526400024 2202816 217996 158390 14 47 25 15 0 17 16 0 233124576 3895990800 5 2268815 224647 165728 14 50 21 15 0 16 13 0 225840576 389957400052 2253829 216797 160551 14 47 23 16 0 22 13 0 260584576 389829080092 2196737 211694 140924 14 53 19 15 0 16 10 0 235784576 389171280022 2157466 210022 137630 14 54 19 14 0 12 13 0 214300576 389238480031 2187735 213862 142711 14 52 20 14 0 25 12 0 219528576 389195400011 2066523 205256 142080 13 49 23 15 0 26 14 0 229460576 389137040049 2108654 200692 135447 13 51 21 15 0 11 11 0 220376576 388624560045 2136419 207493 146813 13 49 22 16 0 36 12 0 229860576 3886978400 7 2163463 212223 151812 14 47 25 14 0 16 13 0 238356576 388914960067 2251650 221728 154429 14 52 20 14 0 65 15 0 211536576 389221080059 2237925 224237 156587 14 53 19 14 0 24 13 0 585024576 386340240037 2240929 229040 148192 15 61 14 10 0 2012-08-09 11:59:04.714-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 43 8 0 794392576 383823160011 20491 576 420 3 10 82 4 0 127 6 0 579328576 384221560021 2006775 205582 119660 12 70 11 7 0 44 5 0 492860576 385123600046 1536525 173377 85320 10 78 7 4 0 218 9 0 585668576 382713200039 1257266 152869 64023 8 83 7 3 0 101 6 0 600168576 381281040010 1438705 160769 68374 9 84 5 3 0 62 5 0 597004576 380989720093 1376841 154012 63912 8 82 7 4 0 61 11 0 850396576 378087720046 1186816 145731 70453 7 78 9 6 0 124 7 0 437388576 381263200015 1208434 149736 57142 7 86 4 3 0 204 11 01105816576 373095320020 1327833 145979 52718 7 87 4 2 0 29 8 0 751020576 3736033200 8 1405474 169916 61982 9 85 4 2 0 38 7 0 626448576 373332440014 1328415 174665 74214 8 84 5 3 0 23 5 0 650040576 371342800028 1351209 179220 71631 8 85 5 2 0 40 10 0 610988576 3705429200 104 1272527 167530 73527 7 85 5 3 0 79 22 02076836576 3548734000 750 1249934 175420 70124 7 88 3 2 0 58 6
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote: On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: SNIP My conclusion looking at the vmstat data is that everything is looking ok until system CPU usage goes through the roof. I'm assuming that's what we are all still looking at. I am still concerned that what patch 4/5 was actually doing was bypassing compaction almost entirely in the contended case which works but not exactly expected And here's a perf report, captured/displayed with perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 sometime during that period just after 12:00:09, when the run queueu was 100. -- Processed 0 events and LOST 1175296! SNIP # 34.63% [k] _raw_spin_lock_irqsave | |--97.30%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | compact_zone_order | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--87.39%-- skb_copy_datagram_iovec | | tcp_recvmsg | | inet_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call | | __recv | | | | | --100.00%-- (nil) | | | --12.61%-- memcpy --2.70%-- [...] So lets just consider this. My interpretation of that is that we are receiving data from the network and copying it into a buffer that is faulted for the first time and backed by THP. All good so far *BUT* we are contending like crazy on the zone lock and probably blocking normal page allocations in the meantime. 14.31% [k] _raw_spin_lock_irq | |--98.08%-- isolate_migratepages_range This is a variation of the same problem but on the LRU lock this time. SNIP -- If I understand what this is telling me, skb_copy_datagram_iovec is responsible for triggering the calls to isolate_freepages_block, isolate_migratepages_range, and isolate_freepages? Sortof. I do not think it's the jumbo frames that are doing it, it's the faulting of the buffer it copies to. FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames and the Linux TCP stack (i.e., no stateful TCP offload). Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8--- mm: compaction: back out if contended --- include/linux/compaction.h |4 ++-- mm/compaction.c| 45 ++-- mm/internal.h |1 + mm/page_alloc.c| 13 + 4 files changed, 51 insertions(+), 12 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 5673459..9c94cba 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write, extern int
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 02:46 PM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote: On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: SNIP My conclusion looking at the vmstat data is that everything is looking ok until system CPU usage goes through the roof. I'm assuming that's what we are all still looking at. I'm concerned about both the high CPU usage as well as the reduction in write-out rate, but I've been assuming the latter is caused by the former. snip Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8--- mm: compaction: back out if contended --- snip Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: 2012-08-09 16:26:07.550-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 21 19 0 351628576 378354400017 44394 1241 653 6 20 64 9 0 11 11 0 365520576 3789306000 124 2121508 203450 170957 12 46 25 17 0 13 16 0 359888576 379544560098 2185033 209473 171571 13 44 25 18 0 17 15 0 353728576 380105360089 2170971 208052 167988 13 43 26 18 0 17 16 0 349732576 3804828400 135 2217752 218754 174170 13 49 21 16 0 43 13 0 343280576 3804650000 153 2207135 217872 179519 13 47 23 18 0 26 13 0 350968576 3793718400 147 2189822 214276 176697 13 47 23 17 0 4 12 0 350080576 3795836400 226 2145212 207077 172163 12 44 24 20 0 15 13 0 353124576 3792104000 145 2078422 197231 166381 12 41 30 17 0 14 15 0 348964576 3794958800 107 2020853 188192 164064 12 39 30 20 0 21 9 0 354784576 3795122800 117 2148090 204307 165609 13 48 22 18 0 36 16 0 347368576 3798982400 166 2208681 216392 178114 13 47 24 16 0 28 15 0 300656576 3806091200 164 2181681 214618 175132 13 45 24 18 0 9 16 0 295484576 3809218400 153 2156909 218993 180289 13 43 27 17 0 17 16 0 346760576 3797900800 165 2124168 198730 173455 12 44 27 18 0 14 17 0 360988576 3795713600 142 2092248 197430 168199 12 42 29 17 0 I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. I really appreciate getting the chance to test out your patchset. Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/