Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-12-05 Thread Srivatsa S. Bhat
Hi Jianguo,

On 12/04/2012 04:21 PM, wujianguo wrote:
> Hi Srivatsa,
> 
> I applied this patchset, and run genload(from LTP) test: numactl --membind=1 
> ./genload -m 100,
> then got a "general protection fault", and system was going to reboot.
> 
> If I revert [RFC PATCH 7/8], and run this test again, genload will be killed 
> due to OOM,
> but the system is OK, no coredump.
> 

Sorry for the delay in replying. Thanks a lot for testing and for the 
bug-report!
I could recreate the issue in one of my machines using the LTP test you 
mentioned.
I'll try to dig and find out what is going wrong.

Regards,
Srivatsa S. Bhat

> ps: node1 has 8G memory.
> 
> [ 3647.020666] general protection fault:  [#1] SMP
> [ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace 
> cpu
> freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm 
> crc32c_
> intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support 
> ioatdma e
> dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr 
> sr_m
> od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid 
> hid
>  uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys 
> hw
> mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic 
> ata_
> piix libata megaraid_sas scsi_mod
> [ 3647.084565] CPU 19
> [ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ 
> #11 Q
> CI QSSC-S4R/QSSC-S4R
> [ 3647.096799] RIP: 0010:[]  [] 
> add_to_freel
> ist+0x8c/0x100
> [ 3647.106125] RSP: :880a7f6c3e58  EFLAGS: 00010086
> [ 3647.112042] RAX: dead00200200 RBX: 0001 RCX: 
> 
> 
> [ 3647.119990] RDX: ea001211a3a0 RSI: ea001211ffa0 RDI: 
> 0001
> 
> [ 3647.127936] RBP: 880a7f6c3e58 R08: 88067ff6d240 R09: 
> 88067ff6b180
> 
> [ 3647.135884] R10: 0002 R11: 0001 R12: 
> 07fe
> 
> [ 3647.143831] R13: 0001 R14: 0001 R15: 
> ea001211ff80
> 
> [ 3647.151778] FS:  7f0b2a674700() GS:880a7f6c() 
> knlGS:0
> 000
> [ 3647.160790] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 3647.167188] CR2: 7f0b1a00 CR3: 000484723000 CR4: 
> 07e0
> 
> [ 3647.175136] DR0:  DR1:  DR2: 
> 
> 
> [ 3647.183083] DR3:  DR6: 0ff0 DR7: 
> 0400
> 
> [ 3647.191030] Process genload (pid: 33708, threadinfo 8806852bc000, task 
> ff
> ff880688288000)
> [ 3647.200428] Stack:
> [ 3647.202667]  880a7f6c3f08 8110e9c0 88067ff66100 
> 0
> 7fe
> [ 3647.210954]  880a7f6d5bb0 0030 2030 
> 88067ff66
> 168
> [ 3647.219244]  0002 880a7f6d5b78 000e88288000 
> 88067ff66
> 100
> [ 3647.227530] Call Trace:
> [ 3647.230252]  
> [ 3647.232394]  [] free_pcppages_bulk+0x350/0x450
> [ 3647.239297]  [] ? drain_pages+0xd0/0xd0
> [ 3647.245313]  [] drain_pages+0xc3/0xd0
> [ 3647.251135]  [] drain_local_pages+0x16/0x20
> [ 3647.257540]  [] 
> generic_smp_call_function_interrupt+0xae/0x
> 260
> [ 3647.265783]  [] smp_call_function_interrupt+0x27/0x40
> [ 3647.273156]  [] call_function_interrupt+0x72/0x80
> [ 3647.280136]  
> [ 3647.282278]  [] ? mutex_spin_on_owner+0x76/0xa0
> [ 3647.289292]  [] __mutex_lock_slowpath+0x66/0x180
> [ 3647.296181]  [] ? try_to_unmap_one+0x277/0x440
> [ 3647.302872]  [] mutex_lock+0x23/0x40
> [ 3647.308595]  [] rmap_walk+0x137/0x240
> [ 3647.314417]  [] ? get_page+0x40/0x40
> [ 3647.320133]  [] move_to_new_page+0xb6/0x110
> [ 3647.326526]  [] __unmap_and_move+0x192/0x230
> [ 3647.333023]  [] unmap_and_move+0x122/0x140
> [ 3647.339328]  [] migrate_pages+0x99/0x150
> [ 3647.345433]  [] ? isolate_freepages+0x220/0x220
> [ 3647.352220]  [] compact_zone+0x2f2/0x5d0
> [ 3647.358332]  [] try_to_compact_pages+0x180/0x240
> [ 3647.365218]  [] __alloc_pages_direct_compact+0x97/0x200
> [ 3647.372780]  [] ? on_each_cpu_mask+0x63/0xb0
> [ 3647.379279]  [] __alloc_pages_slowpath+0x4ff/0x780
> [ 3647.386349]  [] __alloc_pages_nodemask+0x121/0x180
> [ 3647.393430]  [] alloc_pages_vma+0xd6/0x170
> [ 3647.399737]  [] do_huge_pmd_anonymous_page+0x148/0x210
> [ 3647.407203]  [] handle_mm_fault+0x33b/0x340
> [ 3647.413609]  [] __do_page_fault+0x2a3/0x4e0
> [ 3647.420017]  [] ? trace_hardirqs_off_thunk+0x3a/0x6c
> [ 3647.427290]  [] do_page_fault+0xe/0x10
> [ 3647.433208]  [] page_fault+0x28/0x30
> [ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 
> 83
>  40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 <48> 89 
> 30
>  c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79
> [ 3647.460607] RIP  [] add_to_freelist+0x8c/0x100
> [ 3647.467308]  RSP 
> [0.00] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc 
> versi
> on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-12-05 Thread Srivatsa S. Bhat
Hi Jianguo,

On 12/04/2012 04:21 PM, wujianguo wrote:
 Hi Srivatsa,
 
 I applied this patchset, and run genload(from LTP) test: numactl --membind=1 
 ./genload -m 100,
 then got a general protection fault, and system was going to reboot.
 
 If I revert [RFC PATCH 7/8], and run this test again, genload will be killed 
 due to OOM,
 but the system is OK, no coredump.
 

Sorry for the delay in replying. Thanks a lot for testing and for the 
bug-report!
I could recreate the issue in one of my machines using the LTP test you 
mentioned.
I'll try to dig and find out what is going wrong.

Regards,
Srivatsa S. Bhat

 ps: node1 has 8G memory.
 
 [ 3647.020666] general protection fault:  [#1] SMP
 [ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace 
 cpu
 freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm 
 crc32c_
 intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support 
 ioatdma e
 dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr 
 sr_m
 od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid 
 hid
  uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys 
 hw
 mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic 
 ata_
 piix libata megaraid_sas scsi_mod
 [ 3647.084565] CPU 19
 [ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ 
 #11 Q
 CI QSSC-S4R/QSSC-S4R
 [ 3647.096799] RIP: 0010:[8110979c]  [8110979c] 
 add_to_freel
 ist+0x8c/0x100
 [ 3647.106125] RSP: :880a7f6c3e58  EFLAGS: 00010086
 [ 3647.112042] RAX: dead00200200 RBX: 0001 RCX: 
 
 
 [ 3647.119990] RDX: ea001211a3a0 RSI: ea001211ffa0 RDI: 
 0001
 
 [ 3647.127936] RBP: 880a7f6c3e58 R08: 88067ff6d240 R09: 
 88067ff6b180
 
 [ 3647.135884] R10: 0002 R11: 0001 R12: 
 07fe
 
 [ 3647.143831] R13: 0001 R14: 0001 R15: 
 ea001211ff80
 
 [ 3647.151778] FS:  7f0b2a674700() GS:880a7f6c() 
 knlGS:0
 000
 [ 3647.160790] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 3647.167188] CR2: 7f0b1a00 CR3: 000484723000 CR4: 
 07e0
 
 [ 3647.175136] DR0:  DR1:  DR2: 
 
 
 [ 3647.183083] DR3:  DR6: 0ff0 DR7: 
 0400
 
 [ 3647.191030] Process genload (pid: 33708, threadinfo 8806852bc000, task 
 ff
 ff880688288000)
 [ 3647.200428] Stack:
 [ 3647.202667]  880a7f6c3f08 8110e9c0 88067ff66100 
 0
 7fe
 [ 3647.210954]  880a7f6d5bb0 0030 2030 
 88067ff66
 168
 [ 3647.219244]  0002 880a7f6d5b78 000e88288000 
 88067ff66
 100
 [ 3647.227530] Call Trace:
 [ 3647.230252]  IRQ
 [ 3647.232394]  [8110e9c0] free_pcppages_bulk+0x350/0x450
 [ 3647.239297]  [8110f0d0] ? drain_pages+0xd0/0xd0
 [ 3647.245313]  [8110f0c3] drain_pages+0xc3/0xd0
 [ 3647.251135]  [8110f0e6] drain_local_pages+0x16/0x20
 [ 3647.257540]  [810a3bce] 
 generic_smp_call_function_interrupt+0xae/0x
 260
 [ 3647.265783]  [810282c7] smp_call_function_interrupt+0x27/0x40
 [ 3647.273156]  [8147f272] call_function_interrupt+0x72/0x80
 [ 3647.280136]  EOI
 [ 3647.282278]  [81077936] ? mutex_spin_on_owner+0x76/0xa0
 [ 3647.289292]  [81473116] __mutex_lock_slowpath+0x66/0x180
 [ 3647.296181]  [8113afe7] ? try_to_unmap_one+0x277/0x440
 [ 3647.302872]  [81472b93] mutex_lock+0x23/0x40
 [ 3647.308595]  [8113b657] rmap_walk+0x137/0x240
 [ 3647.314417]  [8115c230] ? get_page+0x40/0x40
 [ 3647.320133]  [8115d036] move_to_new_page+0xb6/0x110
 [ 3647.326526]  [8115d452] __unmap_and_move+0x192/0x230
 [ 3647.333023]  [8115d612] unmap_and_move+0x122/0x140
 [ 3647.339328]  [8115d6c9] migrate_pages+0x99/0x150
 [ 3647.345433]  [81129f10] ? isolate_freepages+0x220/0x220
 [ 3647.352220]  [8112ace2] compact_zone+0x2f2/0x5d0
 [ 3647.358332]  [8112b4a0] try_to_compact_pages+0x180/0x240
 [ 3647.365218]  [8110f1e7] __alloc_pages_direct_compact+0x97/0x200
 [ 3647.372780]  [810a45a3] ? on_each_cpu_mask+0x63/0xb0
 [ 3647.379279]  [8110f84f] __alloc_pages_slowpath+0x4ff/0x780
 [ 3647.386349]  [8110fbf1] __alloc_pages_nodemask+0x121/0x180
 [ 3647.393430]  [811500d6] alloc_pages_vma+0xd6/0x170
 [ 3647.399737]  [81162198] do_huge_pmd_anonymous_page+0x148/0x210
 [ 3647.407203]  [81132f6b] handle_mm_fault+0x33b/0x340
 [ 3647.413609]  [814799d3] __do_page_fault+0x2a3/0x4e0
 [ 3647.420017]  [8126316a] ? trace_hardirqs_off_thunk+0x3a/0x6c
 [ 3647.427290]  [81479c1e] do_page_fault+0xe/0x10
 [ 3647.433208]  [81475f68] page_fault+0x28/0x30
 [ 3647.438921] Code: 8d 78 01 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-12-04 Thread wujianguo
Hi Srivatsa,

I applied this patchset, and run genload(from LTP) test: numactl --membind=1 
./genload -m 100,
then got a "general protection fault", and system was going to reboot.

If I revert [RFC PATCH 7/8], and run this test again, genload will be killed 
due to OOM,
but the system is OK, no coredump.

ps: node1 has 8G memory.

[ 3647.020666] general protection fault:  [#1] SMP
[ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace cpu
freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_
intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support ioatdma e
dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr sr_m
od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid hid
 uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hw
mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_
piix libata megaraid_sas scsi_mod
[ 3647.084565] CPU 19
[ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ #11 Q
CI QSSC-S4R/QSSC-S4R
[ 3647.096799] RIP: 0010:[]  [] add_to_freel
ist+0x8c/0x100
[ 3647.106125] RSP: :880a7f6c3e58  EFLAGS: 00010086
[ 3647.112042] RAX: dead00200200 RBX: 0001 RCX: 

[ 3647.119990] RDX: ea001211a3a0 RSI: ea001211ffa0 RDI: 0001

[ 3647.127936] RBP: 880a7f6c3e58 R08: 88067ff6d240 R09: 88067ff6b180

[ 3647.135884] R10: 0002 R11: 0001 R12: 07fe

[ 3647.143831] R13: 0001 R14: 0001 R15: ea001211ff80

[ 3647.151778] FS:  7f0b2a674700() GS:880a7f6c() knlGS:0
000
[ 3647.160790] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3647.167188] CR2: 7f0b1a00 CR3: 000484723000 CR4: 07e0

[ 3647.175136] DR0:  DR1:  DR2: 

[ 3647.183083] DR3:  DR6: 0ff0 DR7: 0400

[ 3647.191030] Process genload (pid: 33708, threadinfo 8806852bc000, task ff
ff880688288000)
[ 3647.200428] Stack:
[ 3647.202667]  880a7f6c3f08 8110e9c0 88067ff66100 0
7fe
[ 3647.210954]  880a7f6d5bb0 0030 2030 88067ff66
168
[ 3647.219244]  0002 880a7f6d5b78 000e88288000 88067ff66
100
[ 3647.227530] Call Trace:
[ 3647.230252]  
[ 3647.232394]  [] free_pcppages_bulk+0x350/0x450
[ 3647.239297]  [] ? drain_pages+0xd0/0xd0
[ 3647.245313]  [] drain_pages+0xc3/0xd0
[ 3647.251135]  [] drain_local_pages+0x16/0x20
[ 3647.257540]  [] generic_smp_call_function_interrupt+0xae/0x
260
[ 3647.265783]  [] smp_call_function_interrupt+0x27/0x40
[ 3647.273156]  [] call_function_interrupt+0x72/0x80
[ 3647.280136]  
[ 3647.282278]  [] ? mutex_spin_on_owner+0x76/0xa0
[ 3647.289292]  [] __mutex_lock_slowpath+0x66/0x180
[ 3647.296181]  [] ? try_to_unmap_one+0x277/0x440
[ 3647.302872]  [] mutex_lock+0x23/0x40
[ 3647.308595]  [] rmap_walk+0x137/0x240
[ 3647.314417]  [] ? get_page+0x40/0x40
[ 3647.320133]  [] move_to_new_page+0xb6/0x110
[ 3647.326526]  [] __unmap_and_move+0x192/0x230
[ 3647.333023]  [] unmap_and_move+0x122/0x140
[ 3647.339328]  [] migrate_pages+0x99/0x150
[ 3647.345433]  [] ? isolate_freepages+0x220/0x220
[ 3647.352220]  [] compact_zone+0x2f2/0x5d0
[ 3647.358332]  [] try_to_compact_pages+0x180/0x240
[ 3647.365218]  [] __alloc_pages_direct_compact+0x97/0x200
[ 3647.372780]  [] ? on_each_cpu_mask+0x63/0xb0
[ 3647.379279]  [] __alloc_pages_slowpath+0x4ff/0x780
[ 3647.386349]  [] __alloc_pages_nodemask+0x121/0x180
[ 3647.393430]  [] alloc_pages_vma+0xd6/0x170
[ 3647.399737]  [] do_huge_pmd_anonymous_page+0x148/0x210
[ 3647.407203]  [] handle_mm_fault+0x33b/0x340
[ 3647.413609]  [] __do_page_fault+0x2a3/0x4e0
[ 3647.420017]  [] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3647.427290]  [] do_page_fault+0xe/0x10
[ 3647.433208]  [] page_fault+0x28/0x30
[ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 83
 40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 <48> 89 30
 c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79
[ 3647.460607] RIP  [] add_to_freelist+0x8c/0x100
[ 3647.467308]  RSP 
[0.00] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc versi
on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #11 SMP Tue Dec 4 15:23
:15 CST 2012
.

Thanks,
Jianguo Wu

On 2012-11-7 3:52, Srivatsa S. Bhat wrote:
> Hi,
> 
> This is an alternative design for Memory Power Management, developed based on
> some of the suggestions[1] received during the review of the earlier patchset
> ("Hierarchy" design) on Memory Power Management[2]. This alters the 
> buddy-lists
> to keep them region-sorted, and is hence identified as the "Sorted-buddy" 
> design.
> 
> One of the key aspects of this design is that it avoids the zone-fragmentation
> problem that was present in the earlier 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-12-04 Thread wujianguo
Hi Srivatsa,

I applied this patchset, and run genload(from LTP) test: numactl --membind=1 
./genload -m 100,
then got a general protection fault, and system was going to reboot.

If I revert [RFC PATCH 7/8], and run this test again, genload will be killed 
due to OOM,
but the system is OK, no coredump.

ps: node1 has 8G memory.

[ 3647.020666] general protection fault:  [#1] SMP
[ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace cpu
freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_
intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support ioatdma e
dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr sr_m
od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid hid
 uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hw
mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_
piix libata megaraid_sas scsi_mod
[ 3647.084565] CPU 19
[ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ #11 Q
CI QSSC-S4R/QSSC-S4R
[ 3647.096799] RIP: 0010:[8110979c]  [8110979c] add_to_freel
ist+0x8c/0x100
[ 3647.106125] RSP: :880a7f6c3e58  EFLAGS: 00010086
[ 3647.112042] RAX: dead00200200 RBX: 0001 RCX: 

[ 3647.119990] RDX: ea001211a3a0 RSI: ea001211ffa0 RDI: 0001

[ 3647.127936] RBP: 880a7f6c3e58 R08: 88067ff6d240 R09: 88067ff6b180

[ 3647.135884] R10: 0002 R11: 0001 R12: 07fe

[ 3647.143831] R13: 0001 R14: 0001 R15: ea001211ff80

[ 3647.151778] FS:  7f0b2a674700() GS:880a7f6c() knlGS:0
000
[ 3647.160790] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3647.167188] CR2: 7f0b1a00 CR3: 000484723000 CR4: 07e0

[ 3647.175136] DR0:  DR1:  DR2: 

[ 3647.183083] DR3:  DR6: 0ff0 DR7: 0400

[ 3647.191030] Process genload (pid: 33708, threadinfo 8806852bc000, task ff
ff880688288000)
[ 3647.200428] Stack:
[ 3647.202667]  880a7f6c3f08 8110e9c0 88067ff66100 0
7fe
[ 3647.210954]  880a7f6d5bb0 0030 2030 88067ff66
168
[ 3647.219244]  0002 880a7f6d5b78 000e88288000 88067ff66
100
[ 3647.227530] Call Trace:
[ 3647.230252]  IRQ
[ 3647.232394]  [8110e9c0] free_pcppages_bulk+0x350/0x450
[ 3647.239297]  [8110f0d0] ? drain_pages+0xd0/0xd0
[ 3647.245313]  [8110f0c3] drain_pages+0xc3/0xd0
[ 3647.251135]  [8110f0e6] drain_local_pages+0x16/0x20
[ 3647.257540]  [810a3bce] generic_smp_call_function_interrupt+0xae/0x
260
[ 3647.265783]  [810282c7] smp_call_function_interrupt+0x27/0x40
[ 3647.273156]  [8147f272] call_function_interrupt+0x72/0x80
[ 3647.280136]  EOI
[ 3647.282278]  [81077936] ? mutex_spin_on_owner+0x76/0xa0
[ 3647.289292]  [81473116] __mutex_lock_slowpath+0x66/0x180
[ 3647.296181]  [8113afe7] ? try_to_unmap_one+0x277/0x440
[ 3647.302872]  [81472b93] mutex_lock+0x23/0x40
[ 3647.308595]  [8113b657] rmap_walk+0x137/0x240
[ 3647.314417]  [8115c230] ? get_page+0x40/0x40
[ 3647.320133]  [8115d036] move_to_new_page+0xb6/0x110
[ 3647.326526]  [8115d452] __unmap_and_move+0x192/0x230
[ 3647.333023]  [8115d612] unmap_and_move+0x122/0x140
[ 3647.339328]  [8115d6c9] migrate_pages+0x99/0x150
[ 3647.345433]  [81129f10] ? isolate_freepages+0x220/0x220
[ 3647.352220]  [8112ace2] compact_zone+0x2f2/0x5d0
[ 3647.358332]  [8112b4a0] try_to_compact_pages+0x180/0x240
[ 3647.365218]  [8110f1e7] __alloc_pages_direct_compact+0x97/0x200
[ 3647.372780]  [810a45a3] ? on_each_cpu_mask+0x63/0xb0
[ 3647.379279]  [8110f84f] __alloc_pages_slowpath+0x4ff/0x780
[ 3647.386349]  [8110fbf1] __alloc_pages_nodemask+0x121/0x180
[ 3647.393430]  [811500d6] alloc_pages_vma+0xd6/0x170
[ 3647.399737]  [81162198] do_huge_pmd_anonymous_page+0x148/0x210
[ 3647.407203]  [81132f6b] handle_mm_fault+0x33b/0x340
[ 3647.413609]  [814799d3] __do_page_fault+0x2a3/0x4e0
[ 3647.420017]  [8126316a] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3647.427290]  [81479c1e] do_page_fault+0xe/0x10
[ 3647.433208]  [81475f68] page_fault+0x28/0x30
[ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 83
 40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 48 89 30
 c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79
[ 3647.460607] RIP  [8110979c] add_to_freelist+0x8c/0x100
[ 3647.467308]  RSP 880a7f6c3e58
[0.00] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc versi
on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #11 SMP Tue 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-16 Thread Srivatsa S. Bhat
On 11/09/2012 10:22 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote:
>> On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
>>> On 11/09/2012 09:43 PM, Dave Hansen wrote:
 On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
> FWIW, kernbench is actually (and surprisingly) showing a slight 
> performance
> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
> my other email to Dave.
>
> https://lkml.org/lkml/2012/11/7/428
>
> I don't think I can dismiss it as an experimental error, because I am 
> seeing
> those results consistently.. I'm trying to find out what's behind that.

 The only numbers in that link are in the date. :)  Let's see the
 numbers, please.

>>>
>>> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
>>> want it to look ridiculous if it later turned out to be really an error in 
>>> the
>>> experiment ;) But since I have seen it happening consistently I think I can
>>> post the numbers here with some non-zero confidence.
>>>
 If you really have performance improvement to the memory allocator (or
 something else) here, then surely it can be pared out of your patches
 and merged quickly by itself.  Those kinds of optimizations are hard to
 come by!

>>>
>>> :-)
>>>
>>> Anyway, here it goes:
>>>
>>> Test setup:
>>> --
>>> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
>>> patchset might not handle NUMA properly). Mem region size = 512 MB.
>>>
>>
>> For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
>> was much lesser, but nevertheless, this patchset performed better. I wouldn't
>> vouch that my patchset handles NUMA correctly, but here are the numbers from
>> that run anyway (at least to show that I really found the results to be
>> repeatable):
>>

I fixed up the NUMA case (I'll post the updated patch for that soon) and
ran a fresh set of kernbench runs. The difference between mainline and this
patchset is quite tiny; so we can't really say that this patchset shows a
performance improvement over mainline. However, I can safely conclude that
this patchset doesn't show any performance _degradation_ w.r.t mainline
in kernbench.

Results from one of the recent kernbench runs:
-

Kernbench log for Vanilla 3.7-rc3
=
Kernel: 3.7.0-rc3
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 330.39 (0.746257)
User Time 4283.63 (3.39617)
System Time 604.783 (2.72629)
Percent CPU 1479 (3.60555)
Context Switches 845634 (6031.22)
Sleeps 833655 (6652.17)


Kernbench log for Sorted-buddy
==
Kernel: 3.7.0-rc3-sorted-buddy
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 329.967 (2.76789)
User Time 4230.02 (2.15324)
System Time 599.793 (1.09988)
Percent CPU 1463.33 (11.3725)
Context Switches 840530 (1646.75)
Sleeps 833732 (2227.68)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-16 Thread Srivatsa S. Bhat
On 11/09/2012 10:22 PM, Srivatsa S. Bhat wrote:
 On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote:
 On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
 On 11/09/2012 09:43 PM, Dave Hansen wrote:
 On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
 FWIW, kernbench is actually (and surprisingly) showing a slight 
 performance
 *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
 my other email to Dave.

 https://lkml.org/lkml/2012/11/7/428

 I don't think I can dismiss it as an experimental error, because I am 
 seeing
 those results consistently.. I'm trying to find out what's behind that.

 The only numbers in that link are in the date. :)  Let's see the
 numbers, please.


 Sure :) The reason I didn't post the numbers very eagerly was that I didn't
 want it to look ridiculous if it later turned out to be really an error in 
 the
 experiment ;) But since I have seen it happening consistently I think I can
 post the numbers here with some non-zero confidence.

 If you really have performance improvement to the memory allocator (or
 something else) here, then surely it can be pared out of your patches
 and merged quickly by itself.  Those kinds of optimizations are hard to
 come by!


 :-)

 Anyway, here it goes:

 Test setup:
 --
 x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
 patchset might not handle NUMA properly). Mem region size = 512 MB.


 For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
 was much lesser, but nevertheless, this patchset performed better. I wouldn't
 vouch that my patchset handles NUMA correctly, but here are the numbers from
 that run anyway (at least to show that I really found the results to be
 repeatable):


I fixed up the NUMA case (I'll post the updated patch for that soon) and
ran a fresh set of kernbench runs. The difference between mainline and this
patchset is quite tiny; so we can't really say that this patchset shows a
performance improvement over mainline. However, I can safely conclude that
this patchset doesn't show any performance _degradation_ w.r.t mainline
in kernbench.

Results from one of the recent kernbench runs:
-

Kernbench log for Vanilla 3.7-rc3
=
Kernel: 3.7.0-rc3
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 330.39 (0.746257)
User Time 4283.63 (3.39617)
System Time 604.783 (2.72629)
Percent CPU 1479 (3.60555)
Context Switches 845634 (6031.22)
Sleeps 833655 (6652.17)


Kernbench log for Sorted-buddy
==
Kernel: 3.7.0-rc3-sorted-buddy
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 329.967 (2.76789)
User Time 4230.02 (2.15324)
System Time 599.793 (1.09988)
Percent CPU 1463.33 (11.3725)
Context Switches 840530 (1646.75)
Sleeps 833732 (2227.68)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-12 Thread Srivatsa S. Bhat
Hi Srinivas,

It looks like your email did not get delivered to the mailing
lists (and the people in the CC list) properly. So quoting your
entire mail as-it-is here. And thanks a lot for taking a look
at this patchset!

Regards,
Srivatsa S. Bhat

On 11/09/2012 10:18 PM, SrinivasPandruvada wrote:
> I did like this implementation and think it is valuable.
> I am experimenting with one of our HW. This type of partition does help in
> saving power. We believe we can save up-to 1W power per DIM with the help
> of some HW/BIOS changes. We are only talking about content preserving memory,
> so we don't have to be 100% correct.
> In my experiments, I tried two methods:
> - Similar to approach suggested by Mel Gorman. I have a special sticky
> migrate type like CMA.
> - Buddy buckets: Buddies are organized into memory region aware buckets.
> During allocation it prefers higher order buckets. I made sure that there is
> no affect of my change if there are no power saving memory DIMs. The advantage
> of this bucket is that I can keep the memory in close proximity for a related
> task groups by direct hashing to a bucket. The free list if organized as two
> dimensional array with bucket and migrate type for each order.
> 
> In both methods, currently reclaim is targetted to be done by a sysfs 
> interface
> similar to memory compaction for a node allowing user space to initiate 
> reclaim. 
> 
> Thanks,
> Srinivas Pandruvada
> Open Source Technology Center,
> Intel Corp.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-12 Thread Srivatsa S. Bhat
Hi Srinivas,

It looks like your email did not get delivered to the mailing
lists (and the people in the CC list) properly. So quoting your
entire mail as-it-is here. And thanks a lot for taking a look
at this patchset!

Regards,
Srivatsa S. Bhat

On 11/09/2012 10:18 PM, SrinivasPandruvada wrote:
 I did like this implementation and think it is valuable.
 I am experimenting with one of our HW. This type of partition does help in
 saving power. We believe we can save up-to 1W power per DIM with the help
 of some HW/BIOS changes. We are only talking about content preserving memory,
 so we don't have to be 100% correct.
 In my experiments, I tried two methods:
 - Similar to approach suggested by Mel Gorman. I have a special sticky
 migrate type like CMA.
 - Buddy buckets: Buddies are organized into memory region aware buckets.
 During allocation it prefers higher order buckets. I made sure that there is
 no affect of my change if there are no power saving memory DIMs. The advantage
 of this bucket is that I can keep the memory in close proximity for a related
 task groups by direct hashing to a bucket. The free list if organized as two
 dimensional array with bucket and migrate type for each order.
 
 In both methods, currently reclaim is targetted to be done by a sysfs 
 interface
 similar to memory compaction for a node allowing user space to initiate 
 reclaim. 
 
 Thanks,
 Srinivas Pandruvada
 Open Source Technology Center,
 Intel Corp.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
>> On 11/09/2012 09:43 PM, Dave Hansen wrote:
>>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
 FWIW, kernbench is actually (and surprisingly) showing a slight performance
 *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
 my other email to Dave.

 https://lkml.org/lkml/2012/11/7/428

 I don't think I can dismiss it as an experimental error, because I am 
 seeing
 those results consistently.. I'm trying to find out what's behind that.
>>>
>>> The only numbers in that link are in the date. :)  Let's see the
>>> numbers, please.
>>>
>>
>> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
>> want it to look ridiculous if it later turned out to be really an error in 
>> the
>> experiment ;) But since I have seen it happening consistently I think I can
>> post the numbers here with some non-zero confidence.
>>
>>> If you really have performance improvement to the memory allocator (or
>>> something else) here, then surely it can be pared out of your patches
>>> and merged quickly by itself.  Those kinds of optimizations are hard to
>>> come by!
>>>
>>
>> :-)
>>
>> Anyway, here it goes:
>>
>> Test setup:
>> --
>> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
>> patchset might not handle NUMA properly). Mem region size = 512 MB.
>>
> 
> For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
> was much lesser, but nevertheless, this patchset performed better. I wouldn't
> vouch that my patchset handles NUMA correctly, but here are the numbers from
> that run anyway (at least to show that I really found the results to be
> repeatable):
> 
> Kernbench log for Vanilla 3.7-rc3
> =
> Kernel: 3.7.0-rc3-vanilla-numa-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 589.058 (0.596171)
> User Time 7461.26 (1.69702)
> System Time 1072.03 (1.54704)
> Percent CPU 1448.2 (1.30384)
> Context Switches 2.14322e+06 (4042.97)
> Sleeps 1847230 (2614.96)
> 
> Kernbench log for Vanilla 3.7-rc3
> =

Oops, that title must have been "for sorted-buddy patchset" of course..

> Kernel: 3.7.0-rc3-sorted-buddy-numa-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 577.182 (0.713772)
> User Time 7315.43 (3.87226)
> System Time 1043 (1.12855)
> Percent CPU 1447.6 (2.19089)
> Context Switches 2117022 (3810.15)
> Sleeps 1.82966e+06 (4149.82)
> 
> 

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 09:43 PM, Dave Hansen wrote:
>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>>> my other email to Dave.
>>>
>>> https://lkml.org/lkml/2012/11/7/428
>>>
>>> I don't think I can dismiss it as an experimental error, because I am seeing
>>> those results consistently.. I'm trying to find out what's behind that.
>>
>> The only numbers in that link are in the date. :)  Let's see the
>> numbers, please.
>>
> 
> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
> want it to look ridiculous if it later turned out to be really an error in the
> experiment ;) But since I have seen it happening consistently I think I can
> post the numbers here with some non-zero confidence.
> 
>> If you really have performance improvement to the memory allocator (or
>> something else) here, then surely it can be pared out of your patches
>> and merged quickly by itself.  Those kinds of optimizations are hard to
>> come by!
>>
> 
> :-)
> 
> Anyway, here it goes:
> 
> Test setup:
> --
> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
> patchset might not handle NUMA properly). Mem region size = 512 MB.
> 

For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
was much lesser, but nevertheless, this patchset performed better. I wouldn't
vouch that my patchset handles NUMA correctly, but here are the numbers from
that run anyway (at least to show that I really found the results to be
repeatable):

Kernbench log for Vanilla 3.7-rc3
=
Kernel: 3.7.0-rc3-vanilla-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 589.058 (0.596171)
User Time 7461.26 (1.69702)
System Time 1072.03 (1.54704)
Percent CPU 1448.2 (1.30384)
Context Switches 2.14322e+06 (4042.97)
Sleeps 1847230 (2614.96)

Kernbench log for Vanilla 3.7-rc3
=
Kernel: 3.7.0-rc3-sorted-buddy-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 577.182 (0.713772)
User Time 7315.43 (3.87226)
System Time 1043 (1.12855)
Percent CPU 1447.6 (2.19089)
Context Switches 2117022 (3810.15)
Sleeps 1.82966e+06 (4149.82)


Regards,
Srivatsa S. Bhat

> Kernbench log for Vanilla 3.7-rc3
> =
> 
> Kernel: 3.7.0-rc3-vanilla-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 650.742 (2.49774)
> User Time 8213.08 (17.6347)
> System Time 1273.91 (6.00643)
> Percent CPU 1457.4 (3.64692)
> Context Switches 2250203 (3846.61)
> Sleeps 1.8781e+06 (5310.33)
> 
> Kernbench log for this sorted-buddy patchset
> 
> 
> Kernel: 3.7.0-rc3-sorted-buddy-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 591.696 (0.660969)
> User Time 7511.97 (1.08313)
> System Time 1062.99 (1.1109)
> Percent CPU 1448.6 (1.94936)
> Context Switches 2.1496e+06 (3507.12)
> Sleeps 1.84305e+06 (3092.67)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 09:43 PM, Dave Hansen wrote:
> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>> my other email to Dave.
>>
>> https://lkml.org/lkml/2012/11/7/428
>>
>> I don't think I can dismiss it as an experimental error, because I am seeing
>> those results consistently.. I'm trying to find out what's behind that.
> 
> The only numbers in that link are in the date. :)  Let's see the
> numbers, please.
> 

Sure :) The reason I didn't post the numbers very eagerly was that I didn't
want it to look ridiculous if it later turned out to be really an error in the
experiment ;) But since I have seen it happening consistently I think I can
post the numbers here with some non-zero confidence.

> If you really have performance improvement to the memory allocator (or
> something else) here, then surely it can be pared out of your patches
> and merged quickly by itself.  Those kinds of optimizations are hard to
> come by!
> 

:-)

Anyway, here it goes:

Test setup:
--
x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
patchset might not handle NUMA properly). Mem region size = 512 MB.

Kernbench log for Vanilla 3.7-rc3
=

Kernel: 3.7.0-rc3-vanilla-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 650.742 (2.49774)
User Time 8213.08 (17.6347)
System Time 1273.91 (6.00643)
Percent CPU 1457.4 (3.64692)
Context Switches 2250203 (3846.61)
Sleeps 1.8781e+06 (5310.33)

Kernbench log for this sorted-buddy patchset


Kernel: 3.7.0-rc3-sorted-buddy-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 591.696 (0.660969)
User Time 7511.97 (1.08313)
System Time 1062.99 (1.1109)
Percent CPU 1448.6 (1.94936)
Context Switches 2.1496e+06 (3507.12)
Sleeps 1.84305e+06 (3092.67)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Dave Hansen
On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
> FWIW, kernbench is actually (and surprisingly) showing a slight performance
> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
> my other email to Dave.
> 
> https://lkml.org/lkml/2012/11/7/428
> 
> I don't think I can dismiss it as an experimental error, because I am seeing
> those results consistently.. I'm trying to find out what's behind that.

The only numbers in that link are in the date. :)  Let's see the
numbers, please.

If you really have performance improvement to the memory allocator (or
something else) here, then surely it can be pared out of your patches
and merged quickly by itself.  Those kinds of optimizations are hard to
come by!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Arjan van de Ven
On 11/8/2012 9:14 PM, Vaidyanathan Srinivasan wrote:
> * Mel Gorman  [2012-11-08 18:02:57]:
> 
>> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
>>> 
> 
> Hi Mel,
> 
> Thanks for detailed review and comments.  The goal of this patch
> series is to brainstorm on ideas that enable Linux VM to record and
> exploit memory region boundaries.
> 
> The first approach that we had last year (hierarchy) has more runtime
> overhead.  This approach of sorted-buddy was one of the alternative
> discussed earlier and we are trying to find out if simple requirements
> of biasing memory allocations can be achieved with this approach.
> 
> Smart reclaim based on this approach is a key piece we still need to
> design.  Ideas from compaction will certainly help.

reclaim may be needed for the embedded use case
but at least we are also looking at memory power savings that come for 
content-preserving power states.
For that, Linux should *statistically* not be actively using (e.g. read or 
write from it) a percentage of memory...
and statistically clustering is quite sufficient for that.

(for example, if you don't use a DIMM for a certain amount of time,
the link and other pieces can go to a lower power state,
even on todays server systems.
In a many-dimm system..  if each app is, on a per app basis,
preferring one dimm for its allocations, the process scheduler will
help us naturally keeping the other dimms "dark")

If you have to actually free the memory, it is a much much harder problem,
increasingly so if the region you MUST free is quite large.

if one solution can solve both cases, great, but lets not make both not happen
because one of the cases is hard...
(and please lets not use moving or freeing of pages as a solution for at least 
the
content preserving case)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 08:21 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 02:30 PM, Mel Gorman wrote:
>> On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
>>> * Mel Gorman  [2012-11-08 18:02:57]:
[...]
> Short description of the "Sorted-buddy" design:
> ---
>
> In this design, the memory region boundaries are captured in a parallel
> data-structure instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain 
> the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions.

 Implying that this sorting has to happen in the either the alloc or free
 fast path.
>>>
>>> Yes, in the free path. This optimization can be actually be delayed in
>>> the free fast path and completely avoided if our memory is full and we
>>> are doing direct reclaim during allocations.
>>>
>>
>> Hurting the free fast path is a bad idea as there are workloads that depend
>> on it (buffer allocation and free) even though many workloads do *not*
>> notice it because the bulk of the cost is incurred at exit time. As
>> memory low power usage has many caveats (may be impossible if a page
>> table is allocated in the region for example) but CPU usage has less
>> restrictions it is more important that the CPU usage be kept low.
>>
>> That means, little or no modification to the fastpath. Sorting or linear
>> searches should be minimised or avoided.
>>
> 
> Right. For example, in the previous "hierarchy" design[1], there was no 
> overhead
> in any of the fast paths. Because it split up the zones themselves, so that
> they fit on memory region boundaries. But that design had other problems, like
> zone fragmentation (too many zones).. which kind of out-weighed the benefit
> obtained from zero overhead in the fast-paths. So one of the suggested
> alternatives during that review[2], was to explore modifying the buddy 
> allocator
> to be aware of memory region boundaries, which this "sorted-buddy" design
> implements.
> 
> [1]. http://lwn.net/Articles/445045/
>  http://thread.gmane.org/gmane.linux.kernel.mm/63840
>  http://thread.gmane.org/gmane.linux.kernel.mm/89202
> 
> [2]. http://article.gmane.org/gmane.linux.power-management.general/24862
>  http://article.gmane.org/gmane.linux.power-management.general/25061
>  http://article.gmane.org/gmane.linux.kernel.mm/64689 
> 
> In this patchset, I have tried to minimize the overhead on the fastpaths.
> For example, I have used a special 'next_region' data-structure to keep the
> alloc path fast. Also, in the free path, we don't need to keep the free
> lists fully address sorted; having them region-sorted is sufficient. Of course
> we could explore more ways of avoiding overhead in the fast paths, or even a
> different design that promises to be much better overall. I'm all ears for
> any suggestions :-)
> 

FWIW, kernbench is actually (and surprisingly) showing a slight performance
*improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
my other email to Dave.

https://lkml.org/lkml/2012/11/7/428

I don't think I can dismiss it as an experimental error, because I am seeing
those results consistently.. I'm trying to find out what's behind that.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 02:30 PM, Mel Gorman wrote:
> On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
>> * Mel Gorman  [2012-11-08 18:02:57]:
>>
[...]
>>> How much power is saved?
>>
>> On embedded platform the savings could be around 5% as discussed in
>> the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935
>>
>> On larger servers with large amounts of memory the savings could be
>> more.  We do not yet have all the pieces together to evaluate.
>>
> 
> Ok, it's something to keep an eye on because if memory power savings
> require large amounts of CPU (for smart placement or migration) or more
> disk accesses (due to reclaim) then the savings will be offset by
> increased power usage elsehwere.
> 

True.

 ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
 the firmware can expose information regarding the boundaries of such memory
 power management domains to the OS in a standard way.

>>>
>>> I'm not familiar with the ACPI spec but is there support for parsing of
>>> MPST and interpreting the associated ACPI events? For example, if ACPI
>>> fires an event indicating that a memory power node is to enter a low
>>> state then presumably the OS should actively migrate pages away -- even
>>> if it's going into a state where the contents are still refreshed
>>> as exiting that state could take a long time.
>>>
>>> I did not look closely at the patchset at all because it looked like the
>>> actual support to use it and measure the benefit is missing.
>>
>> Correct.  The platform interface part is not included in this patch
>> set mainly because there is not much design required there.  Each
>> platform can have code to collect the memory region boundaries from
>> BIOS/firmware and load it into the Linux VM.  The goal of this patch
>> is to brainstorm on the idea of hos core VM should used the region
>> information.
>>  
> 
> Ok. It does mean that the patches should not be merged until there is
> some platform support that can take advantage of them.
>

That's right, but the development of the VM algorithms and the platform
support for different platforms can go on in parallel. And once we have all
the pieces designed, we can fit them together and merge them.
 
 How can Linux VM help memory power savings?

 o Consolidate memory allocations and/or references such that they are
 not spread across the entire memory address space.  Basically area of 
 memory
 that is not being referenced, can reside in low power state.

>>>
>>> Which the series does not appear to do.
>>
>> Correct.  We need to design the correct reclaim strategy for this to
>> work.  However having buddy list sorted by region address could get us
>> one step closer to shaping the allocations.
>>
> 
> If you reclaim, it means that the information is going to disk and will
> have to be refaulted in sooner rather than later. If you concentrate on
> reclaiming low memory regions and memory is almost full, it will lead to
> a situation where you almost always reclaim newer pages and increase
> faulting. You will save a few milliwatts on memory and lose way more
> than that on increase disk traffic and CPU usage.
> 

Yes, we should ensure that our reclaim strategy won't back-fire like that.
We definitely need to depend on LRU ordering for reclaim for the most part,
but try to opportunistically reclaim from within the required region boundaries
while doing that. We definitely need to think more about this...

But the point of making the free lists sorted region-wise in this patchset
was to exploit the shaping of page allocations the way we want (ie.,
constrained to lesser number of regions).

 o Support targeted memory reclaim, where certain areas of memory that can 
 be
 easily freed can be offlined, allowing those areas of memory to be put into
 lower power states.

>>>
>>> Which the series does not appear to do judging from this;
>>>
>>>   include/linux/mm.h |   38 +++
>>>   include/linux/mmzone.h |   52 +
>>>   mm/compaction.c|8 +
>>>   mm/page_alloc.c|  263 
>>> 
>>>   mm/vmstat.c|   59 ++-
>>>
>>> This does not appear to be doing anything with reclaim and not enough with
>>> compaction to indicate that the series actively manages memory placement
>>> in response to ACPI events.
>>
>> Correct.  Evaluating different ideas for reclaim will be next step
>> before getting into the platform interface parts.
>>
[...]
>>
>> This patch is roughly based on the idea that ACPI MPST will give us
>> memory region boundaries.  It is not designed to implement all options
>> defined in the spec. 
> 
> Ok, but as it is the only potential consumer of this interface that you
> mentioned then it should at least be able to handle it. The spec talks about
> overlapping memory regions where the regions potentially have differnet
> power states. This 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Mel Gorman
On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
> * Mel Gorman  [2012-11-08 18:02:57]:
> 
> > On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> > > 
> 
> Hi Mel,
> 
> Thanks for detailed review and comments.  The goal of this patch
> series is to brainstorm on ideas that enable Linux VM to record and
> exploit memory region boundaries.
> 

I see.

> The first approach that we had last year (hierarchy) has more runtime
> overhead.  This approach of sorted-buddy was one of the alternative
> discussed earlier and we are trying to find out if simple requirements
> of biasing memory allocations can be achieved with this approach.
> 
> Smart reclaim based on this approach is a key piece we still need to
> design.  Ideas from compaction will certainly help.
> 
> > > Today memory subsystems are offer a wide range of capabilities for 
> > > managing
> > > memory power consumption. As a quick example, if a block of memory is not
> > > referenced for a threshold amount of time, the memory controller can 
> > > decide to
> > > put that chunk into a low-power content-preserving state. And the next
> > > reference to that memory chunk would bring it back to full power for 
> > > read/write.
> > > With this capability in place, it becomes important for the OS to 
> > > understand
> > > the boundaries of such power-manageable chunks of memory and to ensure 
> > > that
> > > references are consolidated to a minimum number of such memory power 
> > > management
> > > domains.
> > > 
> > 
> > How much power is saved?
> 
> On embedded platform the savings could be around 5% as discussed in
> the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935
> 
> On larger servers with large amounts of memory the savings could be
> more.  We do not yet have all the pieces together to evaluate.
> 

Ok, it's something to keep an eye on because if memory power savings
require large amounts of CPU (for smart placement or migration) or more
disk accesses (due to reclaim) then the savings will be offset by
increased power usage elsehwere.

> > > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so 
> > > that
> > > the firmware can expose information regarding the boundaries of such 
> > > memory
> > > power management domains to the OS in a standard way.
> > > 
> > 
> > I'm not familiar with the ACPI spec but is there support for parsing of
> > MPST and interpreting the associated ACPI events? For example, if ACPI
> > fires an event indicating that a memory power node is to enter a low
> > state then presumably the OS should actively migrate pages away -- even
> > if it's going into a state where the contents are still refreshed
> > as exiting that state could take a long time.
> > 
> > I did not look closely at the patchset at all because it looked like the
> > actual support to use it and measure the benefit is missing.
> 
> Correct.  The platform interface part is not included in this patch
> set mainly because there is not much design required there.  Each
> platform can have code to collect the memory region boundaries from
> BIOS/firmware and load it into the Linux VM.  The goal of this patch
> is to brainstorm on the idea of hos core VM should used the region
> information.
>  

Ok. It does mean that the patches should not be merged until there is
some platform support that can take advantage of them.

> > > How can Linux VM help memory power savings?
> > > 
> > > o Consolidate memory allocations and/or references such that they are
> > > not spread across the entire memory address space.  Basically area of 
> > > memory
> > > that is not being referenced, can reside in low power state.
> > > 
> > 
> > Which the series does not appear to do.
> 
> Correct.  We need to design the correct reclaim strategy for this to
> work.  However having buddy list sorted by region address could get us
> one step closer to shaping the allocations.
> 

If you reclaim, it means that the information is going to disk and will
have to be refaulted in sooner rather than later. If you concentrate on
reclaiming low memory regions and memory is almost full, it will lead to
a situation where you almost always reclaim newer pages and increase
faulting. You will save a few milliwatts on memory and lose way more
than that on increase disk traffic and CPU usage.

> > > o Support targeted memory reclaim, where certain areas of memory that can 
> > > be
> > > easily freed can be offlined, allowing those areas of memory to be put 
> > > into
> > > lower power states.
> > > 
> > 
> > Which the series does not appear to do judging from this;
> > 
> >   include/linux/mm.h |   38 +++
> >   include/linux/mmzone.h |   52 +
> >   mm/compaction.c|8 +
> >   mm/page_alloc.c|  263 
> > 
> >   mm/vmstat.c|   59 ++-
> > 
> > This does not 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Mel Gorman
On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
 * Mel Gorman mgor...@suse.de [2012-11-08 18:02:57]:
 
  On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
   
 
 Hi Mel,
 
 Thanks for detailed review and comments.  The goal of this patch
 series is to brainstorm on ideas that enable Linux VM to record and
 exploit memory region boundaries.
 

I see.

 The first approach that we had last year (hierarchy) has more runtime
 overhead.  This approach of sorted-buddy was one of the alternative
 discussed earlier and we are trying to find out if simple requirements
 of biasing memory allocations can be achieved with this approach.
 
 Smart reclaim based on this approach is a key piece we still need to
 design.  Ideas from compaction will certainly help.
 
   Today memory subsystems are offer a wide range of capabilities for 
   managing
   memory power consumption. As a quick example, if a block of memory is not
   referenced for a threshold amount of time, the memory controller can 
   decide to
   put that chunk into a low-power content-preserving state. And the next
   reference to that memory chunk would bring it back to full power for 
   read/write.
   With this capability in place, it becomes important for the OS to 
   understand
   the boundaries of such power-manageable chunks of memory and to ensure 
   that
   references are consolidated to a minimum number of such memory power 
   management
   domains.
   
  
  How much power is saved?
 
 On embedded platform the savings could be around 5% as discussed in
 the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935
 
 On larger servers with large amounts of memory the savings could be
 more.  We do not yet have all the pieces together to evaluate.
 

Ok, it's something to keep an eye on because if memory power savings
require large amounts of CPU (for smart placement or migration) or more
disk accesses (due to reclaim) then the savings will be offset by
increased power usage elsehwere.

   ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so 
   that
   the firmware can expose information regarding the boundaries of such 
   memory
   power management domains to the OS in a standard way.
   
  
  I'm not familiar with the ACPI spec but is there support for parsing of
  MPST and interpreting the associated ACPI events? For example, if ACPI
  fires an event indicating that a memory power node is to enter a low
  state then presumably the OS should actively migrate pages away -- even
  if it's going into a state where the contents are still refreshed
  as exiting that state could take a long time.
  
  I did not look closely at the patchset at all because it looked like the
  actual support to use it and measure the benefit is missing.
 
 Correct.  The platform interface part is not included in this patch
 set mainly because there is not much design required there.  Each
 platform can have code to collect the memory region boundaries from
 BIOS/firmware and load it into the Linux VM.  The goal of this patch
 is to brainstorm on the idea of hos core VM should used the region
 information.
  

Ok. It does mean that the patches should not be merged until there is
some platform support that can take advantage of them.

   How can Linux VM help memory power savings?
   
   o Consolidate memory allocations and/or references such that they are
   not spread across the entire memory address space.  Basically area of 
   memory
   that is not being referenced, can reside in low power state.
   
  
  Which the series does not appear to do.
 
 Correct.  We need to design the correct reclaim strategy for this to
 work.  However having buddy list sorted by region address could get us
 one step closer to shaping the allocations.
 

If you reclaim, it means that the information is going to disk and will
have to be refaulted in sooner rather than later. If you concentrate on
reclaiming low memory regions and memory is almost full, it will lead to
a situation where you almost always reclaim newer pages and increase
faulting. You will save a few milliwatts on memory and lose way more
than that on increase disk traffic and CPU usage.

   o Support targeted memory reclaim, where certain areas of memory that can 
   be
   easily freed can be offlined, allowing those areas of memory to be put 
   into
   lower power states.
   
  
  Which the series does not appear to do judging from this;
  
include/linux/mm.h |   38 +++
include/linux/mmzone.h |   52 +
mm/compaction.c|8 +
mm/page_alloc.c|  263 
  
mm/vmstat.c|   59 ++-
  
  This does not appear to be doing anything with reclaim and not enough with
  compaction to indicate that the series actively manages memory placement
  in response to ACPI events.
 
 Correct.  

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 02:30 PM, Mel Gorman wrote:
 On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
 * Mel Gorman mgor...@suse.de [2012-11-08 18:02:57]:

[...]
 How much power is saved?

 On embedded platform the savings could be around 5% as discussed in
 the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935

 On larger servers with large amounts of memory the savings could be
 more.  We do not yet have all the pieces together to evaluate.

 
 Ok, it's something to keep an eye on because if memory power savings
 require large amounts of CPU (for smart placement or migration) or more
 disk accesses (due to reclaim) then the savings will be offset by
 increased power usage elsehwere.
 

True.

 ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
 the firmware can expose information regarding the boundaries of such memory
 power management domains to the OS in a standard way.


 I'm not familiar with the ACPI spec but is there support for parsing of
 MPST and interpreting the associated ACPI events? For example, if ACPI
 fires an event indicating that a memory power node is to enter a low
 state then presumably the OS should actively migrate pages away -- even
 if it's going into a state where the contents are still refreshed
 as exiting that state could take a long time.

 I did not look closely at the patchset at all because it looked like the
 actual support to use it and measure the benefit is missing.

 Correct.  The platform interface part is not included in this patch
 set mainly because there is not much design required there.  Each
 platform can have code to collect the memory region boundaries from
 BIOS/firmware and load it into the Linux VM.  The goal of this patch
 is to brainstorm on the idea of hos core VM should used the region
 information.
  
 
 Ok. It does mean that the patches should not be merged until there is
 some platform support that can take advantage of them.


That's right, but the development of the VM algorithms and the platform
support for different platforms can go on in parallel. And once we have all
the pieces designed, we can fit them together and merge them.
 
 How can Linux VM help memory power savings?

 o Consolidate memory allocations and/or references such that they are
 not spread across the entire memory address space.  Basically area of 
 memory
 that is not being referenced, can reside in low power state.


 Which the series does not appear to do.

 Correct.  We need to design the correct reclaim strategy for this to
 work.  However having buddy list sorted by region address could get us
 one step closer to shaping the allocations.

 
 If you reclaim, it means that the information is going to disk and will
 have to be refaulted in sooner rather than later. If you concentrate on
 reclaiming low memory regions and memory is almost full, it will lead to
 a situation where you almost always reclaim newer pages and increase
 faulting. You will save a few milliwatts on memory and lose way more
 than that on increase disk traffic and CPU usage.
 

Yes, we should ensure that our reclaim strategy won't back-fire like that.
We definitely need to depend on LRU ordering for reclaim for the most part,
but try to opportunistically reclaim from within the required region boundaries
while doing that. We definitely need to think more about this...

But the point of making the free lists sorted region-wise in this patchset
was to exploit the shaping of page allocations the way we want (ie.,
constrained to lesser number of regions).

 o Support targeted memory reclaim, where certain areas of memory that can 
 be
 easily freed can be offlined, allowing those areas of memory to be put into
 lower power states.


 Which the series does not appear to do judging from this;

   include/linux/mm.h |   38 +++
   include/linux/mmzone.h |   52 +
   mm/compaction.c|8 +
   mm/page_alloc.c|  263 
 
   mm/vmstat.c|   59 ++-

 This does not appear to be doing anything with reclaim and not enough with
 compaction to indicate that the series actively manages memory placement
 in response to ACPI events.

 Correct.  Evaluating different ideas for reclaim will be next step
 before getting into the platform interface parts.

[...]

 This patch is roughly based on the idea that ACPI MPST will give us
 memory region boundaries.  It is not designed to implement all options
 defined in the spec. 
 
 Ok, but as it is the only potential consumer of this interface that you
 mentioned then it should at least be able to handle it. The spec talks about
 overlapping memory regions where the regions potentially have differnet
 power states. This is pretty damn remarkable and hard to see how it could
 be interpreted in a sensible way but it forces your implementation to take
 it into account.


Well, sorry for not mentioning in the cover-letter, but the 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 08:21 PM, Srivatsa S. Bhat wrote:
 On 11/09/2012 02:30 PM, Mel Gorman wrote:
 On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
 * Mel Gorman mgor...@suse.de [2012-11-08 18:02:57]:
[...]
 Short description of the Sorted-buddy design:
 ---

 In this design, the memory region boundaries are captured in a parallel
 data-structure instead of fitting regions between nodes and zones in the
 hierarchy. Further, the buddy allocator is altered, such that we maintain 
 the
 zones' freelists in region-sorted-order and thus do page allocation in the
 order of increasing memory regions.

 Implying that this sorting has to happen in the either the alloc or free
 fast path.

 Yes, in the free path. This optimization can be actually be delayed in
 the free fast path and completely avoided if our memory is full and we
 are doing direct reclaim during allocations.


 Hurting the free fast path is a bad idea as there are workloads that depend
 on it (buffer allocation and free) even though many workloads do *not*
 notice it because the bulk of the cost is incurred at exit time. As
 memory low power usage has many caveats (may be impossible if a page
 table is allocated in the region for example) but CPU usage has less
 restrictions it is more important that the CPU usage be kept low.

 That means, little or no modification to the fastpath. Sorting or linear
 searches should be minimised or avoided.

 
 Right. For example, in the previous hierarchy design[1], there was no 
 overhead
 in any of the fast paths. Because it split up the zones themselves, so that
 they fit on memory region boundaries. But that design had other problems, like
 zone fragmentation (too many zones).. which kind of out-weighed the benefit
 obtained from zero overhead in the fast-paths. So one of the suggested
 alternatives during that review[2], was to explore modifying the buddy 
 allocator
 to be aware of memory region boundaries, which this sorted-buddy design
 implements.
 
 [1]. http://lwn.net/Articles/445045/
  http://thread.gmane.org/gmane.linux.kernel.mm/63840
  http://thread.gmane.org/gmane.linux.kernel.mm/89202
 
 [2]. http://article.gmane.org/gmane.linux.power-management.general/24862
  http://article.gmane.org/gmane.linux.power-management.general/25061
  http://article.gmane.org/gmane.linux.kernel.mm/64689 
 
 In this patchset, I have tried to minimize the overhead on the fastpaths.
 For example, I have used a special 'next_region' data-structure to keep the
 alloc path fast. Also, in the free path, we don't need to keep the free
 lists fully address sorted; having them region-sorted is sufficient. Of course
 we could explore more ways of avoiding overhead in the fast paths, or even a
 different design that promises to be much better overall. I'm all ears for
 any suggestions :-)
 

FWIW, kernbench is actually (and surprisingly) showing a slight performance
*improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
my other email to Dave.

https://lkml.org/lkml/2012/11/7/428

I don't think I can dismiss it as an experimental error, because I am seeing
those results consistently.. I'm trying to find out what's behind that.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Arjan van de Ven
On 11/8/2012 9:14 PM, Vaidyanathan Srinivasan wrote:
 * Mel Gorman mgor...@suse.de [2012-11-08 18:02:57]:
 
 On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
 
 
 Hi Mel,
 
 Thanks for detailed review and comments.  The goal of this patch
 series is to brainstorm on ideas that enable Linux VM to record and
 exploit memory region boundaries.
 
 The first approach that we had last year (hierarchy) has more runtime
 overhead.  This approach of sorted-buddy was one of the alternative
 discussed earlier and we are trying to find out if simple requirements
 of biasing memory allocations can be achieved with this approach.
 
 Smart reclaim based on this approach is a key piece we still need to
 design.  Ideas from compaction will certainly help.

reclaim may be needed for the embedded use case
but at least we are also looking at memory power savings that come for 
content-preserving power states.
For that, Linux should *statistically* not be actively using (e.g. read or 
write from it) a percentage of memory...
and statistically clustering is quite sufficient for that.

(for example, if you don't use a DIMM for a certain amount of time,
the link and other pieces can go to a lower power state,
even on todays server systems.
In a many-dimm system..  if each app is, on a per app basis,
preferring one dimm for its allocations, the process scheduler will
help us naturally keeping the other dimms dark)

If you have to actually free the memory, it is a much much harder problem,
increasingly so if the region you MUST free is quite large.

if one solution can solve both cases, great, but lets not make both not happen
because one of the cases is hard...
(and please lets not use moving or freeing of pages as a solution for at least 
the
content preserving case)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Dave Hansen
On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
 FWIW, kernbench is actually (and surprisingly) showing a slight performance
 *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
 my other email to Dave.
 
 https://lkml.org/lkml/2012/11/7/428
 
 I don't think I can dismiss it as an experimental error, because I am seeing
 those results consistently.. I'm trying to find out what's behind that.

The only numbers in that link are in the date. :)  Let's see the
numbers, please.

If you really have performance improvement to the memory allocator (or
something else) here, then surely it can be pared out of your patches
and merged quickly by itself.  Those kinds of optimizations are hard to
come by!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 09:43 PM, Dave Hansen wrote:
 On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
 FWIW, kernbench is actually (and surprisingly) showing a slight performance
 *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
 my other email to Dave.

 https://lkml.org/lkml/2012/11/7/428

 I don't think I can dismiss it as an experimental error, because I am seeing
 those results consistently.. I'm trying to find out what's behind that.
 
 The only numbers in that link are in the date. :)  Let's see the
 numbers, please.
 

Sure :) The reason I didn't post the numbers very eagerly was that I didn't
want it to look ridiculous if it later turned out to be really an error in the
experiment ;) But since I have seen it happening consistently I think I can
post the numbers here with some non-zero confidence.

 If you really have performance improvement to the memory allocator (or
 something else) here, then surely it can be pared out of your patches
 and merged quickly by itself.  Those kinds of optimizations are hard to
 come by!
 

:-)

Anyway, here it goes:

Test setup:
--
x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
patchset might not handle NUMA properly). Mem region size = 512 MB.

Kernbench log for Vanilla 3.7-rc3
=

Kernel: 3.7.0-rc3-vanilla-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 650.742 (2.49774)
User Time 8213.08 (17.6347)
System Time 1273.91 (6.00643)
Percent CPU 1457.4 (3.64692)
Context Switches 2250203 (3846.61)
Sleeps 1.8781e+06 (5310.33)

Kernbench log for this sorted-buddy patchset


Kernel: 3.7.0-rc3-sorted-buddy-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 591.696 (0.660969)
User Time 7511.97 (1.08313)
System Time 1062.99 (1.1109)
Percent CPU 1448.6 (1.94936)
Context Switches 2.1496e+06 (3507.12)
Sleeps 1.84305e+06 (3092.67)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
 On 11/09/2012 09:43 PM, Dave Hansen wrote:
 On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
 FWIW, kernbench is actually (and surprisingly) showing a slight performance
 *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
 my other email to Dave.

 https://lkml.org/lkml/2012/11/7/428

 I don't think I can dismiss it as an experimental error, because I am seeing
 those results consistently.. I'm trying to find out what's behind that.

 The only numbers in that link are in the date. :)  Let's see the
 numbers, please.

 
 Sure :) The reason I didn't post the numbers very eagerly was that I didn't
 want it to look ridiculous if it later turned out to be really an error in the
 experiment ;) But since I have seen it happening consistently I think I can
 post the numbers here with some non-zero confidence.
 
 If you really have performance improvement to the memory allocator (or
 something else) here, then surely it can be pared out of your patches
 and merged quickly by itself.  Those kinds of optimizations are hard to
 come by!

 
 :-)
 
 Anyway, here it goes:
 
 Test setup:
 --
 x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
 patchset might not handle NUMA properly). Mem region size = 512 MB.
 

For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
was much lesser, but nevertheless, this patchset performed better. I wouldn't
vouch that my patchset handles NUMA correctly, but here are the numbers from
that run anyway (at least to show that I really found the results to be
repeatable):

Kernbench log for Vanilla 3.7-rc3
=
Kernel: 3.7.0-rc3-vanilla-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 589.058 (0.596171)
User Time 7461.26 (1.69702)
System Time 1072.03 (1.54704)
Percent CPU 1448.2 (1.30384)
Context Switches 2.14322e+06 (4042.97)
Sleeps 1847230 (2614.96)

Kernbench log for Vanilla 3.7-rc3
=
Kernel: 3.7.0-rc3-sorted-buddy-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 577.182 (0.713772)
User Time 7315.43 (3.87226)
System Time 1043 (1.12855)
Percent CPU 1447.6 (2.19089)
Context Switches 2117022 (3810.15)
Sleeps 1.82966e+06 (4149.82)


Regards,
Srivatsa S. Bhat

 Kernbench log for Vanilla 3.7-rc3
 =
 
 Kernel: 3.7.0-rc3-vanilla-default
 Average Optimal load -j 32 Run (std deviation):
 Elapsed Time 650.742 (2.49774)
 User Time 8213.08 (17.6347)
 System Time 1273.91 (6.00643)
 Percent CPU 1457.4 (3.64692)
 Context Switches 2250203 (3846.61)
 Sleeps 1.8781e+06 (5310.33)
 
 Kernbench log for this sorted-buddy patchset
 
 
 Kernel: 3.7.0-rc3-sorted-buddy-default
 Average Optimal load -j 32 Run (std deviation):
 Elapsed Time 591.696 (0.660969)
 User Time 7511.97 (1.08313)
 System Time 1062.99 (1.1109)
 Percent CPU 1448.6 (1.94936)
 Context Switches 2.1496e+06 (3507.12)
 Sleeps 1.84305e+06 (3092.67)
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-09 Thread Srivatsa S. Bhat
On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote:
 On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
 On 11/09/2012 09:43 PM, Dave Hansen wrote:
 On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
 FWIW, kernbench is actually (and surprisingly) showing a slight performance
 *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
 my other email to Dave.

 https://lkml.org/lkml/2012/11/7/428

 I don't think I can dismiss it as an experimental error, because I am 
 seeing
 those results consistently.. I'm trying to find out what's behind that.

 The only numbers in that link are in the date. :)  Let's see the
 numbers, please.


 Sure :) The reason I didn't post the numbers very eagerly was that I didn't
 want it to look ridiculous if it later turned out to be really an error in 
 the
 experiment ;) But since I have seen it happening consistently I think I can
 post the numbers here with some non-zero confidence.

 If you really have performance improvement to the memory allocator (or
 something else) here, then surely it can be pared out of your patches
 and merged quickly by itself.  Those kinds of optimizations are hard to
 come by!


 :-)

 Anyway, here it goes:

 Test setup:
 --
 x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
 patchset might not handle NUMA properly). Mem region size = 512 MB.

 
 For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
 was much lesser, but nevertheless, this patchset performed better. I wouldn't
 vouch that my patchset handles NUMA correctly, but here are the numbers from
 that run anyway (at least to show that I really found the results to be
 repeatable):
 
 Kernbench log for Vanilla 3.7-rc3
 =
 Kernel: 3.7.0-rc3-vanilla-numa-default
 Average Optimal load -j 32 Run (std deviation):
 Elapsed Time 589.058 (0.596171)
 User Time 7461.26 (1.69702)
 System Time 1072.03 (1.54704)
 Percent CPU 1448.2 (1.30384)
 Context Switches 2.14322e+06 (4042.97)
 Sleeps 1847230 (2614.96)
 
 Kernbench log for Vanilla 3.7-rc3
 =

Oops, that title must have been for sorted-buddy patchset of course..

 Kernel: 3.7.0-rc3-sorted-buddy-numa-default
 Average Optimal load -j 32 Run (std deviation):
 Elapsed Time 577.182 (0.713772)
 User Time 7315.43 (3.87226)
 System Time 1043 (1.12855)
 Percent CPU 1447.6 (2.19089)
 Context Switches 2117022 (3810.15)
 Sleeps 1.82966e+06 (4149.82)
 
 

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Vaidyanathan Srinivasan
* Mel Gorman  [2012-11-08 18:02:57]:

> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> > 

Hi Mel,

Thanks for detailed review and comments.  The goal of this patch
series is to brainstorm on ideas that enable Linux VM to record and
exploit memory region boundaries.

The first approach that we had last year (hierarchy) has more runtime
overhead.  This approach of sorted-buddy was one of the alternative
discussed earlier and we are trying to find out if simple requirements
of biasing memory allocations can be achieved with this approach.

Smart reclaim based on this approach is a key piece we still need to
design.  Ideas from compaction will certainly help.

> > Today memory subsystems are offer a wide range of capabilities for managing
> > memory power consumption. As a quick example, if a block of memory is not
> > referenced for a threshold amount of time, the memory controller can decide 
> > to
> > put that chunk into a low-power content-preserving state. And the next
> > reference to that memory chunk would bring it back to full power for 
> > read/write.
> > With this capability in place, it becomes important for the OS to understand
> > the boundaries of such power-manageable chunks of memory and to ensure that
> > references are consolidated to a minimum number of such memory power 
> > management
> > domains.
> > 
> 
> How much power is saved?

On embedded platform the savings could be around 5% as discussed in
the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935

On larger servers with large amounts of memory the savings could be
more.  We do not yet have all the pieces together to evaluate.

> > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> > the firmware can expose information regarding the boundaries of such memory
> > power management domains to the OS in a standard way.
> > 
> 
> I'm not familiar with the ACPI spec but is there support for parsing of
> MPST and interpreting the associated ACPI events? For example, if ACPI
> fires an event indicating that a memory power node is to enter a low
> state then presumably the OS should actively migrate pages away -- even
> if it's going into a state where the contents are still refreshed
> as exiting that state could take a long time.
> 
> I did not look closely at the patchset at all because it looked like the
> actual support to use it and measure the benefit is missing.

Correct.  The platform interface part is not included in this patch
set mainly because there is not much design required there.  Each
platform can have code to collect the memory region boundaries from
BIOS/firmware and load it into the Linux VM.  The goal of this patch
is to brainstorm on the idea of hos core VM should used the region
information.
 
> > How can Linux VM help memory power savings?
> > 
> > o Consolidate memory allocations and/or references such that they are
> > not spread across the entire memory address space.  Basically area of memory
> > that is not being referenced, can reside in low power state.
> > 
> 
> Which the series does not appear to do.

Correct.  We need to design the correct reclaim strategy for this to
work.  However having buddy list sorted by region address could get us
one step closer to shaping the allocations.

> > o Support targeted memory reclaim, where certain areas of memory that can be
> > easily freed can be offlined, allowing those areas of memory to be put into
> > lower power states.
> > 
> 
> Which the series does not appear to do judging from this;
> 
>   include/linux/mm.h |   38 +++
>   include/linux/mmzone.h |   52 +
>   mm/compaction.c|8 +
>   mm/page_alloc.c|  263 
> 
>   mm/vmstat.c|   59 ++-
> 
> This does not appear to be doing anything with reclaim and not enough with
> compaction to indicate that the series actively manages memory placement
> in response to ACPI events.

Correct.  Evaluating different ideas for reclaim will be next step
before getting into the platform interface parts.

> Further in section 5.2.21.4 the spec says that power node regions can
> overlap (but are not hierarchal for some reason) but have no gaps yet the
> structure you use to represent is assumes there can be gaps and there are
> no overlaps. Again, this is just glancing at the spec and a quick skim of
> the patches so maybe I missed something that explains why this structure
> is suitable.

This patch is roughly based on the idea that ACPI MPST will give us
memory region boundaries.  It is not designed to implement all options
defined in the spec.  We have taken a general case of regions do not
overlap while memory addresses itself can be discontinuous.

> It seems to me that superficially the VM implementation for the support
> would have
> 
> a) Involved a tree that managed the overlapping regions (even 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Srivatsa S. Bhat
On 11/08/2012 11:32 PM, Mel Gorman wrote:
> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
>> 
>>
>> Today memory subsystems are offer a wide range of capabilities for managing
>> memory power consumption. As a quick example, if a block of memory is not
>> referenced for a threshold amount of time, the memory controller can decide 
>> to
>> put that chunk into a low-power content-preserving state. And the next
>> reference to that memory chunk would bring it back to full power for 
>> read/write.
>> With this capability in place, it becomes important for the OS to understand
>> the boundaries of such power-manageable chunks of memory and to ensure that
>> references are consolidated to a minimum number of such memory power 
>> management
>> domains.
>>
> 
> How much power is saved?

Last year, Amit had evaluated the "Hierarchy" patchset on a Samsung Exynos (ARM)
board and reported that it could save up to 6.3% relative to total system power.
(This was when he allowed only 1 GB out of the total 2 GB RAM to enter low
power states).

Below is the link to his post, as mentioned in the references section in the
cover letter.
http://article.gmane.org/gmane.linux.kernel.mm/65935

Of course, the power savings depends on the characteristics of the particular
hardware memory subsystem used, and the amount of memory present in the system.

> 
>> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
>> the firmware can expose information regarding the boundaries of such memory
>> power management domains to the OS in a standard way.
>>
> 
> I'm not familiar with the ACPI spec but is there support for parsing of
> MPST and interpreting the associated ACPI events?

Sorry I should have been clearer when I mentioned ACPI 5.0. I mentioned ACPI 5.0
just to make a point that support for getting the memory power management
boundaries from the firmware is not far away. I didn't mean to say that that's
the only target for memory power management. Like I mentioned above, last year
the power-savings benefit was measured on ARM boards. The aim of this patchset
is to propose and evaluate some of the core VM algorithms that we will need
to efficiently exploit the power management features offered by the memory
subsystems.

IOW, info regarding memory power domain boundaries made available by ACPI 5.0
or even just with some help from the bootloader on some platforms is only the
input to the VM subsystem to understand at what granularity it should manage
things. *How* it manages is the choice of the algorithm/design at the VM level,
which is what this patchset is trying to propose, by exploring several different
designs of doing it and its costs/benefits.

That's the reason I just hard-coded mem region size to 512 MB in this patchset
and focussed on the VM algorithm to explore what we can do, once we have that
size/boundary info.

> For example, if ACPI
> fires an event indicating that a memory power node is to enter a low
> state then presumably the OS should actively migrate pages away -- even
> if it's going into a state where the contents are still refreshed
> as exiting that state could take a long time.
> 

We are not really looking at ACPI event notifications here. All we expect from
the firmware (at a first level) is info regarding the boundaries, so that the
VM can be intelligent about how it consolidates references. Many of the memory
subsystems can do power-management automatically - like for example, if a
particular chunk of memory is not referenced for a given threshold time, it can
put it into low-power (content preserving) state without the OS telling it to
do it.

> I did not look closely at the patchset at all because it looked like the
> actual support to use it and measure the benefit is missing.
> 

Right, we are focussing on the core VM algorithms for now. The input (ACPI or
other methods) can come later and then we can measure the numbers.

>> How can Linux VM help memory power savings?
>>
>> o Consolidate memory allocations and/or references such that they are
>> not spread across the entire memory address space.  Basically area of memory
>> that is not being referenced, can reside in low power state.
>>
> 
> Which the series does not appear to do.
> 

Well, it influences page-allocation to be memory-region aware. So it does an
attempt to consolidate allocations (and thereby references). As I mentioned,
hardware transition to low-power state can be automatic. The VM must be
intelligent enough to help with that (or atleast smart enough not to disrupt
that!), by avoiding spreading across allocations everywhere.

>> o Support targeted memory reclaim, where certain areas of memory that can be
>> easily freed can be offlined, allowing those areas of memory to be put into
>> lower power states.
>>
> 
> Which the series does not appear to do judging from this;
> 

Yes, that is one of the items in the TODO list.

>   

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Mel Gorman
On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> 
> 
> Today memory subsystems are offer a wide range of capabilities for managing
> memory power consumption. As a quick example, if a block of memory is not
> referenced for a threshold amount of time, the memory controller can decide to
> put that chunk into a low-power content-preserving state. And the next
> reference to that memory chunk would bring it back to full power for 
> read/write.
> With this capability in place, it becomes important for the OS to understand
> the boundaries of such power-manageable chunks of memory and to ensure that
> references are consolidated to a minimum number of such memory power 
> management
> domains.
> 

How much power is saved?

> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> the firmware can expose information regarding the boundaries of such memory
> power management domains to the OS in a standard way.
> 

I'm not familiar with the ACPI spec but is there support for parsing of
MPST and interpreting the associated ACPI events? For example, if ACPI
fires an event indicating that a memory power node is to enter a low
state then presumably the OS should actively migrate pages away -- even
if it's going into a state where the contents are still refreshed
as exiting that state could take a long time.

I did not look closely at the patchset at all because it looked like the
actual support to use it and measure the benefit is missing.

> How can Linux VM help memory power savings?
> 
> o Consolidate memory allocations and/or references such that they are
> not spread across the entire memory address space.  Basically area of memory
> that is not being referenced, can reside in low power state.
> 

Which the series does not appear to do.

> o Support targeted memory reclaim, where certain areas of memory that can be
> easily freed can be offlined, allowing those areas of memory to be put into
> lower power states.
> 

Which the series does not appear to do judging from this;

  include/linux/mm.h |   38 +++
  include/linux/mmzone.h |   52 +
  mm/compaction.c|8 +
  mm/page_alloc.c|  263 
  mm/vmstat.c|   59 ++-

This does not appear to be doing anything with reclaim and not enough with
compaction to indicate that the series actively manages memory placement
in response to ACPI events.

Further in section 5.2.21.4 the spec says that power node regions can
overlap (but are not hierarchal for some reason) but have no gaps yet the
structure you use to represent is assumes there can be gaps and there are
no overlaps. Again, this is just glancing at the spec and a quick skim of
the patches so maybe I missed something that explains why this structure
is suitable.

It seems to me that superficially the VM implementation for the support
would have

a) Involved a tree that managed the overlapping regions (even if it's
   not hierarchal it feels more sensible) and picked the highest-power-state
   common denominator in the tree. This would only be allocated if support
   for MPST is available.
b) Leave memory allocations and reclaim as they are in the active state.
c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
   power but still usable with a latency penalty. This might be a single
   migrate type but could also be a parallel set of free_area called
   free_area_lowpower that is only used when free_area is depleted and in
   the very slow path of the allocator.
d) Use memory hot-remove for power states where the refresh rates were
   not constant

and only did anything expensive in response to an ACPI event -- none of
the fast paths should be touched.

When transitioning to the low power state, memory should be migrated in
a vaguely similar fashion to what CMA does. For low-power, migration
failure is acceptable. If contents are not preserved, ACPI needs to know
if the migration failed because it cannot enter that power state.

For any of this to be worthwhile, low power states would need to be achieved
for long periods of time because that migration is not free.

> Memory Regions:
> ---
> 
> "Memory Regions" is a way of capturing the boundaries of power-managable
> chunks of memory, within the MM subsystem.
> 
> Short description of the "Sorted-buddy" design:
> ---
> 
> In this design, the memory region boundaries are captured in a parallel
> data-structure instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions.

Implying that this sorting has to happen in the either the alloc or free
fast path.

> (The freelists need not be fully
> 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Mel Gorman
On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
 
 
 Today memory subsystems are offer a wide range of capabilities for managing
 memory power consumption. As a quick example, if a block of memory is not
 referenced for a threshold amount of time, the memory controller can decide to
 put that chunk into a low-power content-preserving state. And the next
 reference to that memory chunk would bring it back to full power for 
 read/write.
 With this capability in place, it becomes important for the OS to understand
 the boundaries of such power-manageable chunks of memory and to ensure that
 references are consolidated to a minimum number of such memory power 
 management
 domains.
 

How much power is saved?

 ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
 the firmware can expose information regarding the boundaries of such memory
 power management domains to the OS in a standard way.
 

I'm not familiar with the ACPI spec but is there support for parsing of
MPST and interpreting the associated ACPI events? For example, if ACPI
fires an event indicating that a memory power node is to enter a low
state then presumably the OS should actively migrate pages away -- even
if it's going into a state where the contents are still refreshed
as exiting that state could take a long time.

I did not look closely at the patchset at all because it looked like the
actual support to use it and measure the benefit is missing.

 How can Linux VM help memory power savings?
 
 o Consolidate memory allocations and/or references such that they are
 not spread across the entire memory address space.  Basically area of memory
 that is not being referenced, can reside in low power state.
 

Which the series does not appear to do.

 o Support targeted memory reclaim, where certain areas of memory that can be
 easily freed can be offlined, allowing those areas of memory to be put into
 lower power states.
 

Which the series does not appear to do judging from this;

  include/linux/mm.h |   38 +++
  include/linux/mmzone.h |   52 +
  mm/compaction.c|8 +
  mm/page_alloc.c|  263 
  mm/vmstat.c|   59 ++-

This does not appear to be doing anything with reclaim and not enough with
compaction to indicate that the series actively manages memory placement
in response to ACPI events.

Further in section 5.2.21.4 the spec says that power node regions can
overlap (but are not hierarchal for some reason) but have no gaps yet the
structure you use to represent is assumes there can be gaps and there are
no overlaps. Again, this is just glancing at the spec and a quick skim of
the patches so maybe I missed something that explains why this structure
is suitable.

It seems to me that superficially the VM implementation for the support
would have

a) Involved a tree that managed the overlapping regions (even if it's
   not hierarchal it feels more sensible) and picked the highest-power-state
   common denominator in the tree. This would only be allocated if support
   for MPST is available.
b) Leave memory allocations and reclaim as they are in the active state.
c) Use a sticky migrate list MIGRATE_LOWPOWER for regions that are in lower
   power but still usable with a latency penalty. This might be a single
   migrate type but could also be a parallel set of free_area called
   free_area_lowpower that is only used when free_area is depleted and in
   the very slow path of the allocator.
d) Use memory hot-remove for power states where the refresh rates were
   not constant

and only did anything expensive in response to an ACPI event -- none of
the fast paths should be touched.

When transitioning to the low power state, memory should be migrated in
a vaguely similar fashion to what CMA does. For low-power, migration
failure is acceptable. If contents are not preserved, ACPI needs to know
if the migration failed because it cannot enter that power state.

For any of this to be worthwhile, low power states would need to be achieved
for long periods of time because that migration is not free.

 Memory Regions:
 ---
 
 Memory Regions is a way of capturing the boundaries of power-managable
 chunks of memory, within the MM subsystem.
 
 Short description of the Sorted-buddy design:
 ---
 
 In this design, the memory region boundaries are captured in a parallel
 data-structure instead of fitting regions between nodes and zones in the
 hierarchy. Further, the buddy allocator is altered, such that we maintain the
 zones' freelists in region-sorted-order and thus do page allocation in the
 order of increasing memory regions.

Implying that this sorting has to happen in the either the alloc or free
fast path.

 (The freelists need not be fully
 address-sorted, they just need to be region-sorted. Patch 

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Srivatsa S. Bhat
On 11/08/2012 11:32 PM, Mel Gorman wrote:
 On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
 

 Today memory subsystems are offer a wide range of capabilities for managing
 memory power consumption. As a quick example, if a block of memory is not
 referenced for a threshold amount of time, the memory controller can decide 
 to
 put that chunk into a low-power content-preserving state. And the next
 reference to that memory chunk would bring it back to full power for 
 read/write.
 With this capability in place, it becomes important for the OS to understand
 the boundaries of such power-manageable chunks of memory and to ensure that
 references are consolidated to a minimum number of such memory power 
 management
 domains.

 
 How much power is saved?

Last year, Amit had evaluated the Hierarchy patchset on a Samsung Exynos (ARM)
board and reported that it could save up to 6.3% relative to total system power.
(This was when he allowed only 1 GB out of the total 2 GB RAM to enter low
power states).

Below is the link to his post, as mentioned in the references section in the
cover letter.
http://article.gmane.org/gmane.linux.kernel.mm/65935

Of course, the power savings depends on the characteristics of the particular
hardware memory subsystem used, and the amount of memory present in the system.

 
 ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
 the firmware can expose information regarding the boundaries of such memory
 power management domains to the OS in a standard way.

 
 I'm not familiar with the ACPI spec but is there support for parsing of
 MPST and interpreting the associated ACPI events?

Sorry I should have been clearer when I mentioned ACPI 5.0. I mentioned ACPI 5.0
just to make a point that support for getting the memory power management
boundaries from the firmware is not far away. I didn't mean to say that that's
the only target for memory power management. Like I mentioned above, last year
the power-savings benefit was measured on ARM boards. The aim of this patchset
is to propose and evaluate some of the core VM algorithms that we will need
to efficiently exploit the power management features offered by the memory
subsystems.

IOW, info regarding memory power domain boundaries made available by ACPI 5.0
or even just with some help from the bootloader on some platforms is only the
input to the VM subsystem to understand at what granularity it should manage
things. *How* it manages is the choice of the algorithm/design at the VM level,
which is what this patchset is trying to propose, by exploring several different
designs of doing it and its costs/benefits.

That's the reason I just hard-coded mem region size to 512 MB in this patchset
and focussed on the VM algorithm to explore what we can do, once we have that
size/boundary info.

 For example, if ACPI
 fires an event indicating that a memory power node is to enter a low
 state then presumably the OS should actively migrate pages away -- even
 if it's going into a state where the contents are still refreshed
 as exiting that state could take a long time.
 

We are not really looking at ACPI event notifications here. All we expect from
the firmware (at a first level) is info regarding the boundaries, so that the
VM can be intelligent about how it consolidates references. Many of the memory
subsystems can do power-management automatically - like for example, if a
particular chunk of memory is not referenced for a given threshold time, it can
put it into low-power (content preserving) state without the OS telling it to
do it.

 I did not look closely at the patchset at all because it looked like the
 actual support to use it and measure the benefit is missing.
 

Right, we are focussing on the core VM algorithms for now. The input (ACPI or
other methods) can come later and then we can measure the numbers.

 How can Linux VM help memory power savings?

 o Consolidate memory allocations and/or references such that they are
 not spread across the entire memory address space.  Basically area of memory
 that is not being referenced, can reside in low power state.

 
 Which the series does not appear to do.
 

Well, it influences page-allocation to be memory-region aware. So it does an
attempt to consolidate allocations (and thereby references). As I mentioned,
hardware transition to low-power state can be automatic. The VM must be
intelligent enough to help with that (or atleast smart enough not to disrupt
that!), by avoiding spreading across allocations everywhere.

 o Support targeted memory reclaim, where certain areas of memory that can be
 easily freed can be offlined, allowing those areas of memory to be put into
 lower power states.

 
 Which the series does not appear to do judging from this;
 

Yes, that is one of the items in the TODO list.

   include/linux/mm.h |   38 +++
   include/linux/mmzone.h |   52 +
   

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Vaidyanathan Srinivasan
* Mel Gorman mgor...@suse.de [2012-11-08 18:02:57]:

 On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
  

Hi Mel,

Thanks for detailed review and comments.  The goal of this patch
series is to brainstorm on ideas that enable Linux VM to record and
exploit memory region boundaries.

The first approach that we had last year (hierarchy) has more runtime
overhead.  This approach of sorted-buddy was one of the alternative
discussed earlier and we are trying to find out if simple requirements
of biasing memory allocations can be achieved with this approach.

Smart reclaim based on this approach is a key piece we still need to
design.  Ideas from compaction will certainly help.

  Today memory subsystems are offer a wide range of capabilities for managing
  memory power consumption. As a quick example, if a block of memory is not
  referenced for a threshold amount of time, the memory controller can decide 
  to
  put that chunk into a low-power content-preserving state. And the next
  reference to that memory chunk would bring it back to full power for 
  read/write.
  With this capability in place, it becomes important for the OS to understand
  the boundaries of such power-manageable chunks of memory and to ensure that
  references are consolidated to a minimum number of such memory power 
  management
  domains.
  
 
 How much power is saved?

On embedded platform the savings could be around 5% as discussed in
the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935

On larger servers with large amounts of memory the savings could be
more.  We do not yet have all the pieces together to evaluate.

  ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
  the firmware can expose information regarding the boundaries of such memory
  power management domains to the OS in a standard way.
  
 
 I'm not familiar with the ACPI spec but is there support for parsing of
 MPST and interpreting the associated ACPI events? For example, if ACPI
 fires an event indicating that a memory power node is to enter a low
 state then presumably the OS should actively migrate pages away -- even
 if it's going into a state where the contents are still refreshed
 as exiting that state could take a long time.
 
 I did not look closely at the patchset at all because it looked like the
 actual support to use it and measure the benefit is missing.

Correct.  The platform interface part is not included in this patch
set mainly because there is not much design required there.  Each
platform can have code to collect the memory region boundaries from
BIOS/firmware and load it into the Linux VM.  The goal of this patch
is to brainstorm on the idea of hos core VM should used the region
information.
 
  How can Linux VM help memory power savings?
  
  o Consolidate memory allocations and/or references such that they are
  not spread across the entire memory address space.  Basically area of memory
  that is not being referenced, can reside in low power state.
  
 
 Which the series does not appear to do.

Correct.  We need to design the correct reclaim strategy for this to
work.  However having buddy list sorted by region address could get us
one step closer to shaping the allocations.

  o Support targeted memory reclaim, where certain areas of memory that can be
  easily freed can be offlined, allowing those areas of memory to be put into
  lower power states.
  
 
 Which the series does not appear to do judging from this;
 
   include/linux/mm.h |   38 +++
   include/linux/mmzone.h |   52 +
   mm/compaction.c|8 +
   mm/page_alloc.c|  263 
 
   mm/vmstat.c|   59 ++-
 
 This does not appear to be doing anything with reclaim and not enough with
 compaction to indicate that the series actively manages memory placement
 in response to ACPI events.

Correct.  Evaluating different ideas for reclaim will be next step
before getting into the platform interface parts.

 Further in section 5.2.21.4 the spec says that power node regions can
 overlap (but are not hierarchal for some reason) but have no gaps yet the
 structure you use to represent is assumes there can be gaps and there are
 no overlaps. Again, this is just glancing at the spec and a quick skim of
 the patches so maybe I missed something that explains why this structure
 is suitable.

This patch is roughly based on the idea that ACPI MPST will give us
memory region boundaries.  It is not designed to implement all options
defined in the spec.  We have taken a general case of regions do not
overlap while memory addresses itself can be discontinuous.

 It seems to me that superficially the VM implementation for the support
 would have
 
 a) Involved a tree that managed the overlapping regions (even if it's
not hierarchal it feels more sensible) and picked the