Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-14 Thread Mel Gorman
On Mon, Aug 13, 2012 at 02:35:46PM -0600, Jim Schutt wrote:
> Hi Mel,
> 
> On 08/12/2012 02:22 PM, Mel Gorman wrote:
> 
> >
> >I went through the patch again but only found the following which is a
> >weak candidate. Still, can you retest with the following patch on top and
> >CONFIG_PROVE_LOCKING set please?
> >
> 
> I've gotten in several hours of testing on this patch with
> no issues at all, and no output from CONFIG_PROVE_LOCKING
> (I'm assuming it would show up on a serial console).  So,
> it seems to me this patch has done the trick.
> 

Super.

> CPU utilization is staying under control, and write-out rate
> is good.
> 

Even better.

> You can add my Tested-by: as you see fit.  If you work
> up any refinements and would like me to test, please
> let me know.
> 

I'll be adding your Tested-by and I'll keep you cc'd on the series. It'll
look a little different because I'm expect to adjust it slightly to match
Andrew's tree but there should be no major surprises and my expectation is
that testing a -rc kernel after it gets merged is all that is necessary. I'm
planning to backport this to -stable but it'll remain to be seen if I can
convince the relevant maintainers that it should be merged.

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-14 Thread Mel Gorman
On Mon, Aug 13, 2012 at 02:35:46PM -0600, Jim Schutt wrote:
 Hi Mel,
 
 On 08/12/2012 02:22 PM, Mel Gorman wrote:
 
 
 I went through the patch again but only found the following which is a
 weak candidate. Still, can you retest with the following patch on top and
 CONFIG_PROVE_LOCKING set please?
 
 
 I've gotten in several hours of testing on this patch with
 no issues at all, and no output from CONFIG_PROVE_LOCKING
 (I'm assuming it would show up on a serial console).  So,
 it seems to me this patch has done the trick.
 

Super.

 CPU utilization is staying under control, and write-out rate
 is good.
 

Even better.

 You can add my Tested-by: as you see fit.  If you work
 up any refinements and would like me to test, please
 let me know.
 

I'll be adding your Tested-by and I'll keep you cc'd on the series. It'll
look a little different because I'm expect to adjust it slightly to match
Andrew's tree but there should be no major surprises and my expectation is
that testing a -rc kernel after it gets merged is all that is necessary. I'm
planning to backport this to -stable but it'll remain to be seen if I can
convince the relevant maintainers that it should be merged.

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-13 Thread Jim Schutt

Hi Mel,

On 08/12/2012 02:22 PM, Mel Gorman wrote:



I went through the patch again but only found the following which is a
weak candidate. Still, can you retest with the following patch on top and
CONFIG_PROVE_LOCKING set please?



I've gotten in several hours of testing on this patch with
no issues at all, and no output from CONFIG_PROVE_LOCKING
(I'm assuming it would show up on a serial console).  So,
it seems to me this patch has done the trick.

CPU utilization is staying under control, and write-out rate
is good.

You can add my Tested-by: as you see fit.  If you work
up any refinements and would like me to test, please
let me know.

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-13 Thread Jim Schutt

Hi Mel,

On 08/12/2012 02:22 PM, Mel Gorman wrote:



I went through the patch again but only found the following which is a
weak candidate. Still, can you retest with the following patch on top and
CONFIG_PROVE_LOCKING set please?



I've gotten in several hours of testing on this patch with
no issues at all, and no output from CONFIG_PROVE_LOCKING
(I'm assuming it would show up on a serial console).  So,
it seems to me this patch has done the trick.

CPU utilization is staying under control, and write-out rate
is good.

You can add my Tested-by: as you see fit.  If you work
up any refinements and would like me to test, please
let me know.

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-12 Thread Mel Gorman
On Fri, Aug 10, 2012 at 11:20:07AM -0600, Jim Schutt wrote:
> On 08/10/2012 05:02 AM, Mel Gorman wrote:
> >On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
> 
> >>>
> >>>Ok, this is an untested hack and I expect it would drop allocation success
> >>>rates again under load (but not as much). Can you test again and see what
> >>>effect, if any, it has please?
> >>>
> >>>---8<---
> >>>mm: compaction: back out if contended
> >>>
> >>>---
> >>
> >>
> >>
> >>Initial testing with this patch looks very good from
> >>my perspective; CPU utilization stays reasonable,
> >>write-out rate stays high, no signs of stress.
> >>Here's an example after ~10 minutes under my test load:
> >>
> 
> Hmmm, I wonder if I should have tested this patch longer,
> in view of the trouble I ran into testing the new patch?
> See below.
> 

The two patches are quite different in what they do. I think it's
unlikely they would share a common bug.

> > 
> >---8<---
> >mm: compaction: Abort async compaction if locks are contended or taking too 
> >long
> 
> 
> Hmmm, while testing this patch, a couple of my servers got
> stuck after ~30 minutes or so, like this:
> 
> [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
> [ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 2515.884447] ceph-osdD  0 30375  1 
> 0x
> [ 2515.891531]  8802e1a99e38 0082 88056b38e298 
> 8802e1a99fd8
> [ 2515.899013]  8802e1a98010 8802e1a98000 8802e1a98000 
> 8802e1a98000
> [ 2515.906482]  8802e1a99fd8 8802e1a98000 880697d31700 
> 8802e1a84500
> [ 2515.913968] Call Trace:
> [ 2515.916433]  [] schedule+0x5d/0x60
> [ 2515.921417]  [] rwsem_down_failed_common+0x105/0x140
> [ 2515.927938]  [] rwsem_down_write_failed+0x13/0x20
> [ 2515.934195]  [] call_rwsem_down_write_failed+0x13/0x20
> [ 2515.940934]  [] ? down_write+0x45/0x50
> [ 2515.946244]  [] sys_mprotect+0xd2/0x240
> [ 2515.951640]  [] system_call_fastpath+0x16/0x1b
> 
> 
> I tried to capture a perf trace while this was going on, but it
> never completed.  "ps" on this system reports lots of kernel threads
> and some user-space stuff, but hangs part way through - no ceph
> executables in the output, oddly.
> 

ps is probably locking up because it's trying to access a proc file for
a process that is not releasing the mmap_sem.

> I can retest your earlier patch for a longer period, to
> see if it does the same thing, or I can do some other thing
> if you tell me what it is.
> 
> Also, FWIW I sorted a little through SysRq-T output from such
> a system; these bits looked interesting:
> 
> [ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17}  (t=6 
> jiffies)
> [ 3663.685099] sending NMI to all CPUs:
> [ 3663.685101] NMI backtrace for cpu 0
> [ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm 
> ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 
> dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net 
> macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm 
> crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode 
> serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en 
> mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 
> i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod 
> ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb 
> dca e1000 [last unloaded: scsi_wait_scan]
> [ 3663.685138]
> [ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a 
> #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
> [ 3663.685142] RIP: 0010:[]  [] 
> _raw_spin_lock_irqsave+0x45/0x60
> [ 3663.685148] RSP: 0018:880a08191898  EFLAGS: 0012
> [ 3663.685149] RAX: 88063fffcb00 RBX: 88063fffcb00 RCX: 
> 00c5
> [ 3663.685149] RDX: 00bf RSI: 015a RDI: 
> 88063fffcb00
> [ 3663.685150] RBP: 880a081918a8 R08:  R09: 
> 
> [ 3663.685151] R10: 88063fffcb98 R11: 88063fffcc38 R12: 
> 0246
> [ 3663.685152] R13: 88063fffcba8 R14: 88063fffcb90 R15: 
> 88063fffc680
> [ 3663.685153] FS:  7fff90ae0700() GS:880627c0() 
> knlGS:
> [ 3663.685154] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 3663.685155] CR2: ff600400 CR3: 0002b8fbe000 CR4: 
> 07f0
> [ 3663.685156] DR0:  DR1:  DR2: 
> 
> [ 3663.685157] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 3663.685158] Process ceph-osd (pid: 100027, threadinfo 880a0819, 
> task 880a9a29ae00)
> [ 3663.685158] Stack:
> [ 3663.685159]  130a  880a08191948 
> 8111a760
> [ 3663.685162]  81a13420 0009 

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-12 Thread Mel Gorman
On Fri, Aug 10, 2012 at 11:20:07AM -0600, Jim Schutt wrote:
 On 08/10/2012 05:02 AM, Mel Gorman wrote:
 On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
 
 
 Ok, this is an untested hack and I expect it would drop allocation success
 rates again under load (but not as much). Can you test again and see what
 effect, if any, it has please?
 
 ---8---
 mm: compaction: back out if contended
 
 ---
 
 snip
 
 Initial testing with this patch looks very good from
 my perspective; CPU utilization stays reasonable,
 write-out rate stays high, no signs of stress.
 Here's an example after ~10 minutes under my test load:
 
 
 Hmmm, I wonder if I should have tested this patch longer,
 in view of the trouble I ran into testing the new patch?
 See below.
 

The two patches are quite different in what they do. I think it's
unlikely they would share a common bug.

  SNIP
 ---8---
 mm: compaction: Abort async compaction if locks are contended or taking too 
 long
 
 
 Hmmm, while testing this patch, a couple of my servers got
 stuck after ~30 minutes or so, like this:
 
 [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
 [ 2515.876630] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
 this message.
 [ 2515.884447] ceph-osdD  0 30375  1 
 0x
 [ 2515.891531]  8802e1a99e38 0082 88056b38e298 
 8802e1a99fd8
 [ 2515.899013]  8802e1a98010 8802e1a98000 8802e1a98000 
 8802e1a98000
 [ 2515.906482]  8802e1a99fd8 8802e1a98000 880697d31700 
 8802e1a84500
 [ 2515.913968] Call Trace:
 [ 2515.916433]  [8147fded] schedule+0x5d/0x60
 [ 2515.921417]  [81480b25] rwsem_down_failed_common+0x105/0x140
 [ 2515.927938]  [81480b73] rwsem_down_write_failed+0x13/0x20
 [ 2515.934195]  [8124bcd3] call_rwsem_down_write_failed+0x13/0x20
 [ 2515.940934]  [8147edc5] ? down_write+0x45/0x50
 [ 2515.946244]  [81127b62] sys_mprotect+0xd2/0x240
 [ 2515.951640]  [81489412] system_call_fastpath+0x16/0x1b
 SNIP
 
 I tried to capture a perf trace while this was going on, but it
 never completed.  ps on this system reports lots of kernel threads
 and some user-space stuff, but hangs part way through - no ceph
 executables in the output, oddly.
 

ps is probably locking up because it's trying to access a proc file for
a process that is not releasing the mmap_sem.

 I can retest your earlier patch for a longer period, to
 see if it does the same thing, or I can do some other thing
 if you tell me what it is.
 
 Also, FWIW I sorted a little through SysRq-T output from such
 a system; these bits looked interesting:
 
 [ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17}  (t=6 
 jiffies)
 [ 3663.685099] sending NMI to all CPUs:
 [ 3663.685101] NMI backtrace for cpu 0
 [ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm 
 ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 
 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net 
 macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm 
 crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode 
 serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en 
 mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 
 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod 
 ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb 
 dca e1000 [last unloaded: scsi_wait_scan]
 [ 3663.685138]
 [ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a 
 #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
 [ 3663.685142] RIP: 0010:[81480ed5]  [81480ed5] 
 _raw_spin_lock_irqsave+0x45/0x60
 [ 3663.685148] RSP: 0018:880a08191898  EFLAGS: 0012
 [ 3663.685149] RAX: 88063fffcb00 RBX: 88063fffcb00 RCX: 
 00c5
 [ 3663.685149] RDX: 00bf RSI: 015a RDI: 
 88063fffcb00
 [ 3663.685150] RBP: 880a081918a8 R08:  R09: 
 
 [ 3663.685151] R10: 88063fffcb98 R11: 88063fffcc38 R12: 
 0246
 [ 3663.685152] R13: 88063fffcba8 R14: 88063fffcb90 R15: 
 88063fffc680
 [ 3663.685153] FS:  7fff90ae0700() GS:880627c0() 
 knlGS:
 [ 3663.685154] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 3663.685155] CR2: ff600400 CR3: 0002b8fbe000 CR4: 
 07f0
 [ 3663.685156] DR0:  DR1:  DR2: 
 
 [ 3663.685157] DR3:  DR6: 0ff0 DR7: 
 0400
 [ 3663.685158] Process ceph-osd (pid: 100027, threadinfo 880a0819, 
 task 880a9a29ae00)
 [ 3663.685158] Stack:
 [ 3663.685159]  130a  880a08191948 
 8111a760
 [ 3663.685162]  81a13420 0009 ea04c240 
 

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-10 Thread Jim Schutt

On 08/10/2012 05:02 AM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:




Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8<---
mm: compaction: back out if contended

---




Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:



Hmmm, I wonder if I should have tested this patch longer,
in view of the trouble I ran into testing the new patch?
See below.



Excellent, so it is contention that is the problem.



I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.



It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.


I really appreciate getting the chance to test out
your patchset.



I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8<---
mm: compaction: Abort async compaction if locks are contended or taking too long



Hmmm, while testing this patch, a couple of my servers got
stuck after ~30 minutes or so, like this:

[ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
[ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2515.884447] ceph-osdD  0 30375  1 0x
[ 2515.891531]  8802e1a99e38 0082 88056b38e298 
8802e1a99fd8
[ 2515.899013]  8802e1a98010 8802e1a98000 8802e1a98000 
8802e1a98000
[ 2515.906482]  8802e1a99fd8 8802e1a98000 880697d31700 
8802e1a84500
[ 2515.913968] Call Trace:
[ 2515.916433]  [] schedule+0x5d/0x60
[ 2515.921417]  [] rwsem_down_failed_common+0x105/0x140
[ 2515.927938]  [] rwsem_down_write_failed+0x13/0x20
[ 2515.934195]  [] call_rwsem_down_write_failed+0x13/0x20
[ 2515.940934]  [] ? down_write+0x45/0x50
[ 2515.946244]  [] sys_mprotect+0xd2/0x240
[ 2515.951640]  [] system_call_fastpath+0x16/0x1b
[ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds.
[ 2515.964330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2515.972141] ceph-osdD  0 95698  1 0x
[ 2515.979223]  8802b049fe38 0082 88056b38e2a0 
8802b049ffd8
[ 2515.986700]  8802b049e010 8802b049e000 8802b049e000 
8802b049e000
[ 2515.994176]  8802b049ffd8 8802b049e000 8809832ddc00 
880611592e00
[ 2516.001653] Call Trace:
[ 2516.004111]  [] schedule+0x5d/0x60
[ 2516.009072]  [] rwsem_down_failed_common+0x105/0x140
[ 2516.015589]  [] rwsem_down_write_failed+0x13/0x20
[ 2516.021861]  [] call_rwsem_down_write_failed+0x13/0x20
[ 2516.028555]  [] ? down_write+0x45/0x50
[ 2516.033859]  [] sys_mprotect+0xd2/0x240
[ 2516.039248]  [] system_call_fastpath+0x16/0x1b
[ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds.
[ 2516.051934] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2516.059753] ceph-osdD  0 95699  1 0x
[ 2516.066832]  880c022d3dc8 0082 880c022d2000 
880c022d3fd8
[ 2516.074302]  880c022d2010 880c022d2000 880c022d2000 
880c022d2000
[ 2516.081784]  880c022d3fd8 880c022d2000 8806224cc500 
88096b64dc00
[ 2516.089254] Call Trace:
[ 2516.091702]  [] schedule+0x5d/0x60
[ 2516.096656]  [] rwsem_down_failed_common+0x105/0x140
[ 2516.103176]  [] rwsem_down_write_failed+0x13/0x20
[ 2516.109443]  [] call_rwsem_down_write_failed+0x13/0x20
[ 2516.116134]  [] ? down_write+0x45/0x50
[ 2516.121442]  [] vm_mmap_pgoff+0x6e/0xb0
[ 2516.126861]  [] sys_mmap_pgoff+0x18a/0x190
[ 2516.132552]  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 2516.138985]  [] sys_mmap+0x22/0x30
[ 2516.143945]  [] system_call_fastpath+0x16/0x1b
[ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds.
[ 2516.156632] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2516.16] ceph-osdD  0 95816  1 0x
[ 2516.171521]  880332991e38 0082 880332991de8 
880332991fd8
[ 2516.178992]  880332990010 88033299 88033299 
88033299
[ 2516.186466]  880332991fd8 88033299 880697d31700 
880a92c32e00
[ 

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-10 Thread Mel Gorman
On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
> >>
> >
> >My conclusion looking at the vmstat data is that everything is looking ok
> >until system CPU usage goes through the roof. I'm assuming that's what we
> >are all still looking at.
> 
> I'm concerned about both the high CPU usage as well as the
> reduction in write-out rate, but I've been assuming the latter
> is caused by the former.
> 

Almost certainly.

> 
> 
> >
> >Ok, this is an untested hack and I expect it would drop allocation success
> >rates again under load (but not as much). Can you test again and see what
> >effect, if any, it has please?
> >
> >---8<---
> >mm: compaction: back out if contended
> >
> >---
> 
> 
> 
> Initial testing with this patch looks very good from
> my perspective; CPU utilization stays reasonable,
> write-out rate stays high, no signs of stress.
> Here's an example after ~10 minutes under my test load:
> 

Excellent, so it is contention that is the problem.

> 
> I'll continue testing tomorrow to be sure nothing
> shows up after continued testing.
> 
> If this passes your allocation success rate testing,
> I'm happy with this performance for 3.6 - if not, I'll
> be happy to test any further patches.
> 

It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.

> I really appreciate getting the chance to test out
> your patchset.
> 

I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8<---
mm: compaction: Abort async compaction if locks are contended or taking too long

Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive.  FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

Early in the test everything looks fine

procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
31 15  0 287216576   3860662800 2  11582   
14   1  3  95  0  0
27 15  0 225288576   385833840018 016 
203357 134876  11 56  17 15  0
28 17  0 219256576   385447360011 2305932 
203141 146296  11 49  23 17  0
 6 18  0 215596576   3855287200 7 2363207 
215264 166502  12 45  22 20  0
22 18  0 226984576   3859640400 3 2445741 
223114 179527  12 43  23 22  0

and then it goes to pot

procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
163  8  0 464308576   367913680011 22210  866  
536   3 13  79  4  0
207 14  0 917752576   3618192800   712 1345376 
134598 47367   7 90   1  2  0
123 12  0 685516576   3629614800   429 1386615 
158494 60077   8 84   5  3  0
123 12  0 598572576   3633372800  1107 1233281 
147542 62351   7 84   5  4  0
622  7  0 660768576   3611826400   557 1345548 
151394 59353   7 85   4  3  0
223 11  0 283960576   364638680046 1107160 
121846 33006   6 93   1  1  0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

  perf record -g -a sleep 10
  perf report --sort symbol --call-graph fractal,5
34.63%  [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
|  compaction_alloc
|  unmap_and_move
|  migrate_pages
|  compact_zone
|  compact_zone_order
|  try_to_compact_pages
|  __alloc_pages_direct_compact
|  __alloc_pages_slowpath
|  __alloc_pages_nodemask
|  alloc_pages_vma
|  do_huge_pmd_anonymous_page
|  handle_mm_fault
|  do_page_fault
|  page_fault
|  |
|  |--87.39%-- skb_copy_datagram_iovec
|  |  

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-10 Thread Mel Gorman
On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
 SNIP
 
 My conclusion looking at the vmstat data is that everything is looking ok
 until system CPU usage goes through the roof. I'm assuming that's what we
 are all still looking at.
 
 I'm concerned about both the high CPU usage as well as the
 reduction in write-out rate, but I've been assuming the latter
 is caused by the former.
 

Almost certainly.

 snip
 
 
 Ok, this is an untested hack and I expect it would drop allocation success
 rates again under load (but not as much). Can you test again and see what
 effect, if any, it has please?
 
 ---8---
 mm: compaction: back out if contended
 
 ---
 
 snip
 
 Initial testing with this patch looks very good from
 my perspective; CPU utilization stays reasonable,
 write-out rate stays high, no signs of stress.
 Here's an example after ~10 minutes under my test load:
 

Excellent, so it is contention that is the problem.

 SNIP
 I'll continue testing tomorrow to be sure nothing
 shows up after continued testing.
 
 If this passes your allocation success rate testing,
 I'm happy with this performance for 3.6 - if not, I'll
 be happy to test any further patches.
 

It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.

 I really appreciate getting the chance to test out
 your patchset.
 

I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8---
mm: compaction: Abort async compaction if locks are contended or taking too long

Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive.  FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

Early in the test everything looks fine

procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
31 15  0 287216576   3860662800 2  11582   
14   1  3  95  0  0
27 15  0 225288576   385833840018 016 
203357 134876  11 56  17 15  0
28 17  0 219256576   385447360011 2305932 
203141 146296  11 49  23 17  0
 6 18  0 215596576   3855287200 7 2363207 
215264 166502  12 45  22 20  0
22 18  0 226984576   3859640400 3 2445741 
223114 179527  12 43  23 22  0

and then it goes to pot

procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
163  8  0 464308576   367913680011 22210  866  
536   3 13  79  4  0
207 14  0 917752576   3618192800   712 1345376 
134598 47367   7 90   1  2  0
123 12  0 685516576   3629614800   429 1386615 
158494 60077   8 84   5  3  0
123 12  0 598572576   3633372800  1107 1233281 
147542 62351   7 84   5  4  0
622  7  0 660768576   3611826400   557 1345548 
151394 59353   7 85   4  3  0
223 11  0 283960576   364638680046 1107160 
121846 33006   6 93   1  1  0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

  perf record -g -a sleep 10
  perf report --sort symbol --call-graph fractal,5
34.63%  [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
|  compaction_alloc
|  unmap_and_move
|  migrate_pages
|  compact_zone
|  compact_zone_order
|  try_to_compact_pages
|  __alloc_pages_direct_compact
|  __alloc_pages_slowpath
|  __alloc_pages_nodemask
|  alloc_pages_vma
|  do_huge_pmd_anonymous_page
|  handle_mm_fault
|  do_page_fault
|  page_fault
|  |
|  |--87.39%-- skb_copy_datagram_iovec
|  |  tcp_recvmsg
|  |

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-10 Thread Jim Schutt

On 08/10/2012 05:02 AM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:




Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8---
mm: compaction: back out if contended

---


snip

Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:



Hmmm, I wonder if I should have tested this patch longer,
in view of the trouble I ran into testing the new patch?
See below.



Excellent, so it is contention that is the problem.


SNIP
I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.



It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.


I really appreciate getting the chance to test out
your patchset.



I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8---
mm: compaction: Abort async compaction if locks are contended or taking too long



Hmmm, while testing this patch, a couple of my servers got
stuck after ~30 minutes or so, like this:

[ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
[ 2515.876630] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2515.884447] ceph-osdD  0 30375  1 0x
[ 2515.891531]  8802e1a99e38 0082 88056b38e298 
8802e1a99fd8
[ 2515.899013]  8802e1a98010 8802e1a98000 8802e1a98000 
8802e1a98000
[ 2515.906482]  8802e1a99fd8 8802e1a98000 880697d31700 
8802e1a84500
[ 2515.913968] Call Trace:
[ 2515.916433]  [8147fded] schedule+0x5d/0x60
[ 2515.921417]  [81480b25] rwsem_down_failed_common+0x105/0x140
[ 2515.927938]  [81480b73] rwsem_down_write_failed+0x13/0x20
[ 2515.934195]  [8124bcd3] call_rwsem_down_write_failed+0x13/0x20
[ 2515.940934]  [8147edc5] ? down_write+0x45/0x50
[ 2515.946244]  [81127b62] sys_mprotect+0xd2/0x240
[ 2515.951640]  [81489412] system_call_fastpath+0x16/0x1b
[ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds.
[ 2515.964330] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2515.972141] ceph-osdD  0 95698  1 0x
[ 2515.979223]  8802b049fe38 0082 88056b38e2a0 
8802b049ffd8
[ 2515.986700]  8802b049e010 8802b049e000 8802b049e000 
8802b049e000
[ 2515.994176]  8802b049ffd8 8802b049e000 8809832ddc00 
880611592e00
[ 2516.001653] Call Trace:
[ 2516.004111]  [8147fded] schedule+0x5d/0x60
[ 2516.009072]  [81480b25] rwsem_down_failed_common+0x105/0x140
[ 2516.015589]  [81480b73] rwsem_down_write_failed+0x13/0x20
[ 2516.021861]  [8124bcd3] call_rwsem_down_write_failed+0x13/0x20
[ 2516.028555]  [8147edc5] ? down_write+0x45/0x50
[ 2516.033859]  [81127b62] sys_mprotect+0xd2/0x240
[ 2516.039248]  [81489412] system_call_fastpath+0x16/0x1b
[ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds.
[ 2516.051934] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2516.059753] ceph-osdD  0 95699  1 0x
[ 2516.066832]  880c022d3dc8 0082 880c022d2000 
880c022d3fd8
[ 2516.074302]  880c022d2010 880c022d2000 880c022d2000 
880c022d2000
[ 2516.081784]  880c022d3fd8 880c022d2000 8806224cc500 
88096b64dc00
[ 2516.089254] Call Trace:
[ 2516.091702]  [8147fded] schedule+0x5d/0x60
[ 2516.096656]  [81480b25] rwsem_down_failed_common+0x105/0x140
[ 2516.103176]  [81480b73] rwsem_down_write_failed+0x13/0x20
[ 2516.109443]  [8124bcd3] call_rwsem_down_write_failed+0x13/0x20
[ 2516.116134]  [8147edc5] ? down_write+0x45/0x50
[ 2516.121442]  [8111362e] vm_mmap_pgoff+0x6e/0xb0
[ 2516.126861]  [8112486a] sys_mmap_pgoff+0x18a/0x190
[ 2516.132552]  [8124bd6e] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 2516.138985]  [81006b22] sys_mmap+0x22/0x30
[ 2516.143945]  [81489412] system_call_fastpath+0x16/0x1b
[ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds.
[ 2516.156632] echo 0  

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 02:46 PM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order>   0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order>   0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:




My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.


I'm concerned about both the high CPU usage as well as the
reduction in write-out rate, but I've been assuming the latter
is caused by the former.





Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8<---
mm: compaction: back out if contended

---




Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:

2012-08-09 16:26:07.550-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
21 19  0 351628576   378354400017 44394 1241  
653   6 20  64  9  0
11 11  0 365520576   3789306000   124 2121508 
203450 170957  12 46  25 17  0
13 16  0 359888576   379544560098 2185033 
209473 171571  13 44  25 18  0
17 15  0 353728576   380105360089 2170971 
208052 167988  13 43  26 18  0
17 16  0 349732576   3804828400   135 2217752 
218754 174170  13 49  21 16  0
43 13  0 343280576   3804650000   153 2207135 
217872 179519  13 47  23 18  0
26 13  0 350968576   3793718400   147 2189822 
214276 176697  13 47  23 17  0
 4 12  0 350080576   3795836400   226 2145212 
207077 172163  12 44  24 20  0
15 13  0 353124576   3792104000   145 2078422 
197231 166381  12 41  30 17  0
14 15  0 348964576   3794958800   107 2020853 
188192 164064  12 39  30 20  0
21  9  0 354784576   3795122800   117 2148090 
204307 165609  13 48  22 18  0
36 16  0 347368576   3798982400   166 2208681 
216392 178114  13 47  24 16  0
28 15  0 300656576   3806091200   164 2181681 
214618 175132  13 45  24 18  0
 9 16  0 295484576   3809218400   153 2156909 
218993 180289  13 43  27 17  0
17 16  0 346760576   3797900800   165 2124168 
198730 173455  12 44  27 18  0
14 17  0 360988576   3795713600   142 2092248 
197430 168199  12 42  29 17  0

I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.

I really appreciate getting the chance to test out
your patchset.

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Mel Gorman
On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:
> On 08/09/2012 07:49 AM, Mel Gorman wrote:
> >Changelog since V2
> >o Capture !MIGRATE_MOVABLE pages where possible
> >o Document the treatment of MIGRATE_MOVABLE pages while capturing
> >o Expand changelogs
> >
> >Changelog since V1
> >o Dropped kswapd related patch, basically a no-op and regresses if fixed 
> >(minchan)
> >o Expanded changelogs a little
> >
> >Allocation success rates have been far lower since 3.4 due to commit
> >[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> >commit was introduced for good reasons and it was known in advance that
> >the success rates would suffer but it was justified on the grounds that
> >the high allocation success rates were achieved by aggressive reclaim.
> >Success rates are expected to suffer even more in 3.6 due to commit
> >[7db8889a: mm: have order>  0 compaction start off where it left] which
> >testing has shown to severely reduce allocation success rates under load -
> >to 0% in one case.  There is a proposed change to that patch in this series
> >and it would be ideal if Jim Schutt could retest the workload that led to
> >commit [7db8889a: mm: have order>  0 compaction start off where it left].
> 
> On my first test of this patch series on top of 3.5, I ran into an
> instance of what I think is the sort of thing that patch 4/5 was
> fixing.  Here's what vmstat had to say during that period:
> 
> 

My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.

I am still concerned that what patch 4/5 was actually doing was bypassing
compaction almost entirely in the contended case which "works" but not
exactly expected

> And here's a perf report, captured/displayed with
>   perf record -g -a sleep 10
>   perf report --sort symbol --call-graph fractal,5
> sometime during that period just after 12:00:09, when
> the run queueu was > 100.
> 
> --
> 
> Processed 0 events and LOST 1175296!
> 
> 
> #
> 34.63%  [k] _raw_spin_lock_irqsave
> |
> |--97.30%-- isolate_freepages
> |  compaction_alloc
> |  unmap_and_move
> |  migrate_pages
> |  compact_zone
> |  compact_zone_order
> |  try_to_compact_pages
> |  __alloc_pages_direct_compact
> |  __alloc_pages_slowpath
> |  __alloc_pages_nodemask
> |  alloc_pages_vma
> |  do_huge_pmd_anonymous_page
> |  handle_mm_fault
> |  do_page_fault
> |  page_fault
> |  |
> |  |--87.39%-- skb_copy_datagram_iovec
> |  |  tcp_recvmsg
> |  |  inet_recvmsg
> |  |  sock_recvmsg
> |  |  sys_recvfrom
> |  |  system_call
> |  |  __recv
> |  |  |
> |  |   --100.00%-- (nil)
> |  |
> |   --12.61%-- memcpy
>  --2.70%-- [...]

So lets just consider this. My interpretation of that is that we are
receiving data from the network and copying it into a buffer that is
faulted for the first time and backed by THP.

All good so far *BUT* we are contending like crazy on the zone lock and
probably blocking normal page allocations in the meantime.

> 
> 14.31%  [k] _raw_spin_lock_irq
> |
> |--98.08%-- isolate_migratepages_range

This is a variation of the same problem but on the LRU lock this time.

> 
> 
> --
> 
> If I understand what this is telling me, skb_copy_datagram_iovec
> is responsible for triggering the calls to isolate_freepages_block,
> isolate_migratepages_range, and isolate_freepages?
> 

Sortof. I do not think it's the jumbo frames that are doing it, it's the
faulting of the buffer it copies to.

> FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames
> and the Linux TCP stack (i.e., no stateful TCP offload).
> 

Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8<---
mm: compaction: back out if contended

---
 include/linux/compaction.h |4 ++--
 mm/compaction.c|   45 ++--
 mm/internal.h  |1 +
 mm/page_alloc.c|   13 +
 4 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5673459..9c94cba 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ 

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order>  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order>  0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:

--

2012-08-09 11:58:04.107-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
20 14  0 235884576   389160720012 17047  171  
133   3  8  85  4  0
18 17  0 220272576   389559120086 2131838 
200142 162956  12 38  31 19  0
17  9  0 244284576   389553280019 2179562 
213775 167901  13 43  26 18  0
27 15  0 223036576   389526400024 2202816 
217996 158390  14 47  25 15  0
17 16  0 233124576   3895990800 5 2268815 
224647 165728  14 50  21 15  0
16 13  0 225840576   389957400052 2253829 
216797 160551  14 47  23 16  0
22 13  0 260584576   389829080092 2196737 
211694 140924  14 53  19 15  0
16 10  0 235784576   389171280022 2157466 
210022 137630  14 54  19 14  0
12 13  0 214300576   389238480031 2187735 
213862 142711  14 52  20 14  0
25 12  0 219528576   389195400011 2066523 
205256 142080  13 49  23 15  0
26 14  0 229460576   389137040049 2108654 
200692 135447  13 51  21 15  0
11 11  0 220376576   388624560045 2136419 
207493 146813  13 49  22 16  0
36 12  0 229860576   3886978400 7 2163463 
212223 151812  14 47  25 14  0
16 13  0 238356576   388914960067 2251650 
221728 154429  14 52  20 14  0
65 15  0 211536576   389221080059 2237925 
224237 156587  14 53  19 14  0
24 13  0 585024576   386340240037 2240929 
229040 148192  15 61  14 10  0

2012-08-09 11:59:04.714-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
43  8  0 794392576   383823160011 20491  576  
420   3 10  82  4  0
127  6  0 579328576   384221560021 2006775 
205582 119660  12 70  11  7  0
44  5  0 492860576   385123600046 1536525 
173377 85320  10 78   7  4  0
218  9  0 585668576   382713200039 1257266 
152869 64023   8 83   7  3  0
101  6  0 600168576   381281040010 1438705 
160769 68374   9 84   5  3  0
62  5  0 597004576   380989720093 1376841 
154012 63912   8 82   7  4  0
61 11  0 850396576   378087720046 1186816 
145731 70453   7 78   9  6  0
124  7  0 437388576   381263200015 1208434 
149736 57142   7 86   4  3  0
204 11  01105816576   373095320020 1327833 
145979 52718   7 87   4  2  0
29  8  0 751020576   3736033200 8 1405474 
169916 61982   9 85   4  2  0
38  7  0 626448576   373332440014 1328415 
174665 74214   8 84   5  3  0
23  5  0 650040576   371342800028 1351209 
179220 71631   8 85   5  2  0
40 10  0 610988576   3705429200   104 1272527 
167530 73527   7 85   5  3  0
79 22  02076836576   3548734000   750 1249934 
175420 70124   7 88   3  2  0
58  6

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Mel Gorman
On Thu, Aug 09, 2012 at 08:36:12AM -0600, Jim Schutt wrote:
> Hi Mel,
> 
> On 08/09/2012 07:49 AM, Mel Gorman wrote:
> >Changelog since V2
> >o Capture !MIGRATE_MOVABLE pages where possible
> >o Document the treatment of MIGRATE_MOVABLE pages while capturing
> >o Expand changelogs
> >
> >Changelog since V1
> >o Dropped kswapd related patch, basically a no-op and regresses if fixed 
> >(minchan)
> >o Expanded changelogs a little
> >
> >Allocation success rates have been far lower since 3.4 due to commit
> >[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> >commit was introduced for good reasons and it was known in advance that
> >the success rates would suffer but it was justified on the grounds that
> >the high allocation success rates were achieved by aggressive reclaim.
> >Success rates are expected to suffer even more in 3.6 due to commit
> >[7db8889a: mm: have order>  0 compaction start off where it left] which
> >testing has shown to severely reduce allocation success rates under load -
> >to 0% in one case.  There is a proposed change to that patch in this series
> >and it would be ideal if Jim Schutt could retest the workload that led to
> >commit [7db8889a: mm: have order>  0 compaction start off where it left].
> 
> I was successful at resolving my Ceph issue on 3.6-rc1, but ran
> into some other issue that isn't immediately obvious, and prevents
> me from testing your patch with 3.6-rc1.  Today I will apply your
> patch series to 3.5 and test that way.
> 
> Sorry for the delay.
> 

No need to be sorry at all. I appreciate you taking the time and as
there were revisions since V1 you were better off waiting even if you
did not have the Ceph issue!

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

Hi Mel,

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order>  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order>  0 compaction start off where it left].


I was successful at resolving my Ceph issue on 3.6-rc1, but ran
into some other issue that isn't immediately obvious, and prevents
me from testing your patch with 3.6-rc1.  Today I will apply your
patch series to 3.5 and test that way.

Sorry for the delay.

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

Hi Mel,

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order  0 compaction start off where it left].


I was successful at resolving my Ceph issue on 3.6-rc1, but ran
into some other issue that isn't immediately obvious, and prevents
me from testing your patch with 3.6-rc1.  Today I will apply your
patch series to 3.5 and test that way.

Sorry for the delay.

-- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Mel Gorman
On Thu, Aug 09, 2012 at 08:36:12AM -0600, Jim Schutt wrote:
 Hi Mel,
 
 On 08/09/2012 07:49 AM, Mel Gorman wrote:
 Changelog since V2
 o Capture !MIGRATE_MOVABLE pages where possible
 o Document the treatment of MIGRATE_MOVABLE pages while capturing
 o Expand changelogs
 
 Changelog since V1
 o Dropped kswapd related patch, basically a no-op and regresses if fixed 
 (minchan)
 o Expanded changelogs a little
 
 Allocation success rates have been far lower since 3.4 due to commit
 [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
 commit was introduced for good reasons and it was known in advance that
 the success rates would suffer but it was justified on the grounds that
 the high allocation success rates were achieved by aggressive reclaim.
 Success rates are expected to suffer even more in 3.6 due to commit
 [7db8889a: mm: have order  0 compaction start off where it left] which
 testing has shown to severely reduce allocation success rates under load -
 to 0% in one case.  There is a proposed change to that patch in this series
 and it would be ideal if Jim Schutt could retest the workload that led to
 commit [7db8889a: mm: have order  0 compaction start off where it left].
 
 I was successful at resolving my Ceph issue on 3.6-rc1, but ran
 into some other issue that isn't immediately obvious, and prevents
 me from testing your patch with 3.6-rc1.  Today I will apply your
 patch series to 3.5 and test that way.
 
 Sorry for the delay.
 

No need to be sorry at all. I appreciate you taking the time and as
there were revisions since V1 you were better off waiting even if you
did not have the Ceph issue!

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order  0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:

--

2012-08-09 11:58:04.107-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
20 14  0 235884576   389160720012 17047  171  
133   3  8  85  4  0
18 17  0 220272576   389559120086 2131838 
200142 162956  12 38  31 19  0
17  9  0 244284576   389553280019 2179562 
213775 167901  13 43  26 18  0
27 15  0 223036576   389526400024 2202816 
217996 158390  14 47  25 15  0
17 16  0 233124576   3895990800 5 2268815 
224647 165728  14 50  21 15  0
16 13  0 225840576   389957400052 2253829 
216797 160551  14 47  23 16  0
22 13  0 260584576   389829080092 2196737 
211694 140924  14 53  19 15  0
16 10  0 235784576   389171280022 2157466 
210022 137630  14 54  19 14  0
12 13  0 214300576   389238480031 2187735 
213862 142711  14 52  20 14  0
25 12  0 219528576   389195400011 2066523 
205256 142080  13 49  23 15  0
26 14  0 229460576   389137040049 2108654 
200692 135447  13 51  21 15  0
11 11  0 220376576   388624560045 2136419 
207493 146813  13 49  22 16  0
36 12  0 229860576   3886978400 7 2163463 
212223 151812  14 47  25 14  0
16 13  0 238356576   388914960067 2251650 
221728 154429  14 52  20 14  0
65 15  0 211536576   389221080059 2237925 
224237 156587  14 53  19 14  0
24 13  0 585024576   386340240037 2240929 
229040 148192  15 61  14 10  0

2012-08-09 11:59:04.714-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
43  8  0 794392576   383823160011 20491  576  
420   3 10  82  4  0
127  6  0 579328576   384221560021 2006775 
205582 119660  12 70  11  7  0
44  5  0 492860576   385123600046 1536525 
173377 85320  10 78   7  4  0
218  9  0 585668576   382713200039 1257266 
152869 64023   8 83   7  3  0
101  6  0 600168576   381281040010 1438705 
160769 68374   9 84   5  3  0
62  5  0 597004576   380989720093 1376841 
154012 63912   8 82   7  4  0
61 11  0 850396576   378087720046 1186816 
145731 70453   7 78   9  6  0
124  7  0 437388576   381263200015 1208434 
149736 57142   7 86   4  3  0
204 11  01105816576   373095320020 1327833 
145979 52718   7 87   4  2  0
29  8  0 751020576   3736033200 8 1405474 
169916 61982   9 85   4  2  0
38  7  0 626448576   373332440014 1328415 
174665 74214   8 84   5  3  0
23  5  0 650040576   371342800028 1351209 
179220 71631   8 85   5  2  0
40 10  0 610988576   3705429200   104 1272527 
167530 73527   7 85   5  3  0
79 22  02076836576   3548734000   750 1249934 
175420 70124   7 88   3  2  0
58  6  

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Mel Gorman
On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:
 On 08/09/2012 07:49 AM, Mel Gorman wrote:
 Changelog since V2
 o Capture !MIGRATE_MOVABLE pages where possible
 o Document the treatment of MIGRATE_MOVABLE pages while capturing
 o Expand changelogs
 
 Changelog since V1
 o Dropped kswapd related patch, basically a no-op and regresses if fixed 
 (minchan)
 o Expanded changelogs a little
 
 Allocation success rates have been far lower since 3.4 due to commit
 [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
 commit was introduced for good reasons and it was known in advance that
 the success rates would suffer but it was justified on the grounds that
 the high allocation success rates were achieved by aggressive reclaim.
 Success rates are expected to suffer even more in 3.6 due to commit
 [7db8889a: mm: have order  0 compaction start off where it left] which
 testing has shown to severely reduce allocation success rates under load -
 to 0% in one case.  There is a proposed change to that patch in this series
 and it would be ideal if Jim Schutt could retest the workload that led to
 commit [7db8889a: mm: have order  0 compaction start off where it left].
 
 On my first test of this patch series on top of 3.5, I ran into an
 instance of what I think is the sort of thing that patch 4/5 was
 fixing.  Here's what vmstat had to say during that period:
 
 SNIP

My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.

I am still concerned that what patch 4/5 was actually doing was bypassing
compaction almost entirely in the contended case which works but not
exactly expected

 And here's a perf report, captured/displayed with
   perf record -g -a sleep 10
   perf report --sort symbol --call-graph fractal,5
 sometime during that period just after 12:00:09, when
 the run queueu was  100.
 
 --
 
 Processed 0 events and LOST 1175296!
 
 SNIP
 #
 34.63%  [k] _raw_spin_lock_irqsave
 |
 |--97.30%-- isolate_freepages
 |  compaction_alloc
 |  unmap_and_move
 |  migrate_pages
 |  compact_zone
 |  compact_zone_order
 |  try_to_compact_pages
 |  __alloc_pages_direct_compact
 |  __alloc_pages_slowpath
 |  __alloc_pages_nodemask
 |  alloc_pages_vma
 |  do_huge_pmd_anonymous_page
 |  handle_mm_fault
 |  do_page_fault
 |  page_fault
 |  |
 |  |--87.39%-- skb_copy_datagram_iovec
 |  |  tcp_recvmsg
 |  |  inet_recvmsg
 |  |  sock_recvmsg
 |  |  sys_recvfrom
 |  |  system_call
 |  |  __recv
 |  |  |
 |  |   --100.00%-- (nil)
 |  |
 |   --12.61%-- memcpy
  --2.70%-- [...]

So lets just consider this. My interpretation of that is that we are
receiving data from the network and copying it into a buffer that is
faulted for the first time and backed by THP.

All good so far *BUT* we are contending like crazy on the zone lock and
probably blocking normal page allocations in the meantime.

 
 14.31%  [k] _raw_spin_lock_irq
 |
 |--98.08%-- isolate_migratepages_range

This is a variation of the same problem but on the LRU lock this time.

 SNIP
 
 --
 
 If I understand what this is telling me, skb_copy_datagram_iovec
 is responsible for triggering the calls to isolate_freepages_block,
 isolate_migratepages_range, and isolate_freepages?
 

Sortof. I do not think it's the jumbo frames that are doing it, it's the
faulting of the buffer it copies to.

 FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames
 and the Linux TCP stack (i.e., no stateful TCP offload).
 

Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8---
mm: compaction: back out if contended

---
 include/linux/compaction.h |4 ++--
 mm/compaction.c|   45 ++--
 mm/internal.h  |1 +
 mm/page_alloc.c|   13 +
 4 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5673459..9c94cba 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, 
int write,
 extern int 

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 02:46 PM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order   0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order   0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:

SNIP


My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.


I'm concerned about both the high CPU usage as well as the
reduction in write-out rate, but I've been assuming the latter
is caused by the former.

snip



Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8---
mm: compaction: back out if contended

---


snip

Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:

2012-08-09 16:26:07.550-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
21 19  0 351628576   378354400017 44394 1241  
653   6 20  64  9  0
11 11  0 365520576   3789306000   124 2121508 
203450 170957  12 46  25 17  0
13 16  0 359888576   379544560098 2185033 
209473 171571  13 44  25 18  0
17 15  0 353728576   380105360089 2170971 
208052 167988  13 43  26 18  0
17 16  0 349732576   3804828400   135 2217752 
218754 174170  13 49  21 16  0
43 13  0 343280576   3804650000   153 2207135 
217872 179519  13 47  23 18  0
26 13  0 350968576   3793718400   147 2189822 
214276 176697  13 47  23 17  0
 4 12  0 350080576   3795836400   226 2145212 
207077 172163  12 44  24 20  0
15 13  0 353124576   3792104000   145 2078422 
197231 166381  12 41  30 17  0
14 15  0 348964576   3794958800   107 2020853 
188192 164064  12 39  30 20  0
21  9  0 354784576   3795122800   117 2148090 
204307 165609  13 48  22 18  0
36 16  0 347368576   3798982400   166 2208681 
216392 178114  13 47  24 16  0
28 15  0 300656576   3806091200   164 2181681 
214618 175132  13 45  24 18  0
 9 16  0 295484576   3809218400   153 2156909 
218993 180289  13 43  27 17  0
17 16  0 346760576   3797900800   165 2124168 
198730 173455  12 44  27 18  0
14 17  0 360988576   3795713600   142 2092248 
197430 168199  12 42  29 17  0

I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.

I really appreciate getting the chance to test out
your patchset.

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/