Re: [PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock

2017-07-13 Thread hejianet

Hi Ben
I add some printk logs in watchdog_timer_fn in the guest
[   16.025222] get_vtb=8236291881, get_tb=13756711357, get_timestamp=4
[   20.025624] get_vtb=9745285807, get_tb=15804711283, get_timestamp=7
[   24.025042] get_vtb=11518119641, get_tb=17852711085, get_timestamp=10
[   28.024074] get_vtb=13192704319, get_tb=19900711071, get_timestamp=13
[   32.024086] get_vtb=14856516982, get_tb=21948711066, get_timestamp=16
[   36.024075] get_vtb=16569127618, get_tb=23996711078, get_timestamp=20
[   40.024138] get_vtb=17008865823, get_tb=26044718418, get_timestamp=20
[   44.023993] get_vtb=17020637241, get_tb=28092716383, get_timestamp=20
[   48.023996] get_vtb=17022857170, get_tb=30140718472, get_timestamp=20
[   52.023996] get_vtb=17024268541, get_tb=32188718432, get_timestamp=20
[   56.023996] get_vtb=17036577783, get_tb=34236718077, get_timestamp=20
[   60.023996] get_vtb=17037829743, get_tb=36284718437, get_timestamp=20
[   64.023992] get_vtb=17039846747, get_tb=38332716609, get_timestamp=20
[   68.023991] get_vtb=17041448345, get_tb=40380715903, get_timestamp=20

The get_timestamp(use get_vtb(),unit is second) is slower down compared
with printk time. You also can obviously watch the get_vtb increment is
slowly less than get_tb increment.

Without this patch, I thought there might be some softlockup warnings
missed in guest.

-Jia

On 13/07/2017 6:45 AM, Benjamin Herrenschmidt wrote:

On Wed, 2017-07-12 at 23:01 +0800, Jia He wrote:

Virtual time base(vtb) is a register which increases only in guest.
Any exit from guest to host will stop the vtb(saved and restored by kvm).
But if there is an IO causes guest exits to host, the guest's watchdog
(watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock)
needs to also include the time elapsed in host. get_vtb is not correct in
this case.

Also, the TB_OFFSET is well saved and restored by qemu after commit [1].
So we can use get_tb here.


That completely defeats the purpose here... This was done specifically
to exploit the VTB which doesn't count in hypervisor mode.



[1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1

Signed-off-by: Jia He 
---
  arch/powerpc/kernel/time.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index fe6f3a2..c542dd3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void)
  unsigned long long running_clock(void)
  {
/*
-* Don't read the VTB as a host since KVM does not switch in host
-* timebase into the VTB when it takes a guest off the CPU, reading the
-* VTB would result in reading 'last switched out' guest VTB.
+* Use get_tb instead of get_vtb for guest since the TB_OFFSET has been
+* well saved/restored when qemu does suspend/resume.
 *
 * Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it
 * would be unsafe to rely only on the #ifdef above.
 */
if (firmware_has_feature(FW_FEATURE_LPAR) &&
cpu_has_feature(CPU_FTR_ARCH_207S))
-   return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
+   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
  
  	/*

 * This is a next best approximation without a VTB.




Re: [PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock

2017-07-13 Thread hejianet

Hi Ben
I add some printk logs in watchdog_timer_fn in the guest
[   16.025222] get_vtb=8236291881, get_tb=13756711357, get_timestamp=4
[   20.025624] get_vtb=9745285807, get_tb=15804711283, get_timestamp=7
[   24.025042] get_vtb=11518119641, get_tb=17852711085, get_timestamp=10
[   28.024074] get_vtb=13192704319, get_tb=19900711071, get_timestamp=13
[   32.024086] get_vtb=14856516982, get_tb=21948711066, get_timestamp=16
[   36.024075] get_vtb=16569127618, get_tb=23996711078, get_timestamp=20
[   40.024138] get_vtb=17008865823, get_tb=26044718418, get_timestamp=20
[   44.023993] get_vtb=17020637241, get_tb=28092716383, get_timestamp=20
[   48.023996] get_vtb=17022857170, get_tb=30140718472, get_timestamp=20
[   52.023996] get_vtb=17024268541, get_tb=32188718432, get_timestamp=20
[   56.023996] get_vtb=17036577783, get_tb=34236718077, get_timestamp=20
[   60.023996] get_vtb=17037829743, get_tb=36284718437, get_timestamp=20
[   64.023992] get_vtb=17039846747, get_tb=38332716609, get_timestamp=20
[   68.023991] get_vtb=17041448345, get_tb=40380715903, get_timestamp=20

The get_timestamp(use get_vtb(),unit is second) is slower down compared
with printk time. You also can obviously watch the get_vtb increment is
slowly less than get_tb increment.

Without this patch, I thought there might be some softlockup warnings
missed in guest.

-Jia

On 13/07/2017 6:45 AM, Benjamin Herrenschmidt wrote:

On Wed, 2017-07-12 at 23:01 +0800, Jia He wrote:

Virtual time base(vtb) is a register which increases only in guest.
Any exit from guest to host will stop the vtb(saved and restored by kvm).
But if there is an IO causes guest exits to host, the guest's watchdog
(watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock)
needs to also include the time elapsed in host. get_vtb is not correct in
this case.

Also, the TB_OFFSET is well saved and restored by qemu after commit [1].
So we can use get_tb here.


That completely defeats the purpose here... This was done specifically
to exploit the VTB which doesn't count in hypervisor mode.



[1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1

Signed-off-by: Jia He 
---
  arch/powerpc/kernel/time.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index fe6f3a2..c542dd3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void)
  unsigned long long running_clock(void)
  {
/*
-* Don't read the VTB as a host since KVM does not switch in host
-* timebase into the VTB when it takes a guest off the CPU, reading the
-* VTB would result in reading 'last switched out' guest VTB.
+* Use get_tb instead of get_vtb for guest since the TB_OFFSET has been
+* well saved/restored when qemu does suspend/resume.
 *
 * Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it
 * would be unsafe to rely only on the #ifdef above.
 */
if (firmware_has_feature(FW_FEATURE_LPAR) &&
cpu_has_feature(CPU_FTR_ARCH_207S))
-   return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
+   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
  
  	/*

 * This is a next best approximation without a VTB.




Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages

2017-02-27 Thread hejianet

Hi Johannes

I have another concern:
kswapd -> balance_pgdat -> age_active_anon
This code path will do some background works to age anon list, will this
patch have some impact on it if the retry time is > 16 and kswapd is
not waken up?

B.R.
Jia

On 28/02/2017 1:06 AM, Johannes Weiner wrote:

On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:

On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
[...]

>From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Fri, 24 Feb 2017 10:56:32 -0500
Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.


Yes this looks, nice I would just be worried about [1] a bit.
Maybe that is worth a separate patch though.

[1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3qu...@dhcp22.suse.cz


I think I'd prefer the simplicity of keeping this contained inside
vmscan.c, as an interaction between direct reclaimers and kswapd, as
well as leaving the wakeup tied to actually seeing reclaimable pages
rather than merely producing free pages (e.g. should we also add a
kick to a large munmap() for example?).

OOM kills come with such high latencies that I cannot imagine a
slightly quicker kswapd restart would matter in practice.


Reported-by: Jia He 
Signed-off-by: Johannes Weiner 


Acked-by: Michal Hocko 


Thanks!


I would have just one more suggestion. Please move MAX_RECLAIM_RETRIES
to mm/internal.h. This is MM internal thing and there is no need to make
it visible.


Good point, I'll move it.



Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages

2017-02-27 Thread hejianet

Hi Johannes

I have another concern:
kswapd -> balance_pgdat -> age_active_anon
This code path will do some background works to age anon list, will this
patch have some impact on it if the retry time is > 16 and kswapd is
not waken up?

B.R.
Jia

On 28/02/2017 1:06 AM, Johannes Weiner wrote:

On Mon, Feb 27, 2017 at 09:50:24AM +0100, Michal Hocko wrote:

On Fri 24-02-17 11:51:05, Johannes Weiner wrote:
[...]

>From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Fri, 24 Feb 2017 10:56:32 -0500
Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.


Yes this looks, nice I would just be worried about [1] a bit.
Maybe that is worth a separate patch though.

[1] http://lkml.kernel.org/r/20170223111609.hlncnvokhq3qu...@dhcp22.suse.cz


I think I'd prefer the simplicity of keeping this contained inside
vmscan.c, as an interaction between direct reclaimers and kswapd, as
well as leaving the wakeup tied to actually seeing reclaimable pages
rather than merely producing free pages (e.g. should we also add a
kick to a large munmap() for example?).

OOM kills come with such high latencies that I cannot imagine a
slightly quicker kswapd restart would matter in practice.


Reported-by: Jia He 
Signed-off-by: Johannes Weiner 


Acked-by: Michal Hocko 


Thanks!


I would have just one more suggestion. Please move MAX_RECLAIM_RETRIES
to mm/internal.h. This is MM internal thing and there is no need to make
it visible.


Good point, I'll move it.



Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages

2017-02-26 Thread hejianet


Hi
Tested-by: Jia He 

cat /proc/meminfo
[...]
CmaFree:   0 kB
HugePages_Total:1831
HugePages_Free: 1831
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:  16384 kB

top - 06:50:29 up  1:26,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.6 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30577664 used,   793856 free,  256 buffers
KiB Swap:  6284224 total,  128 used,  6284096 free.   281280 cached Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   79 root  20   0   0  0  0 S 0.000 0.000   0:00.00 kswapd3


On 25/02/2017 12:51 AM, Johannes Weiner wrote:

On Fri, Feb 24, 2017 at 09:49:50AM +0100, Michal Hocko wrote:

I believe we should pursue the proposal from Johannes which is more
generic and copes with corner cases much better.


Jia, can you try this? I'll put the cleanups in follow-up patches.

---


From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001

From: Johannes Weiner 
Date: Fri, 24 Feb 2017 10:56:32 -0500
Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

Reported-by: Jia He 
Signed-off-by: Johannes Weiner 
---
 include/linux/mmzone.h |  2 ++
 include/linux/swap.h   |  1 +
 mm/page_alloc.c|  6 --
 mm/vmscan.c| 20 
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..d2c50ab6ae40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -630,6 +630,8 @@ typedef struct pglist_data {
int kswapd_order;
enum zone_type kswapd_classzone_idx;

+   int kswapd_failures;/* Number of 'reclaimed == 0' runs */
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd6716d..5c06581a730b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -288,6 +288,7 @@ extern void lru_cache_add_active_or_unevictable(struct page 
*page,
struct vm_area_struct *vma);

 /* linux/mm/vmscan.c */
+#define MAX_RECLAIM_RETRIES 16
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..83f0442f07fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }

 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward 
progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress 

Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages

2017-02-26 Thread hejianet


Hi
Tested-by: Jia He 

cat /proc/meminfo
[...]
CmaFree:   0 kB
HugePages_Total:1831
HugePages_Free: 1831
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:  16384 kB

top - 06:50:29 up  1:26,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.6 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30577664 used,   793856 free,  256 buffers
KiB Swap:  6284224 total,  128 used,  6284096 free.   281280 cached Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   79 root  20   0   0  0  0 S 0.000 0.000   0:00.00 kswapd3


On 25/02/2017 12:51 AM, Johannes Weiner wrote:

On Fri, Feb 24, 2017 at 09:49:50AM +0100, Michal Hocko wrote:

I believe we should pursue the proposal from Johannes which is more
generic and copes with corner cases much better.


Jia, can you try this? I'll put the cleanups in follow-up patches.

---


From 29fefdca148e28830e0934d4e6cceb95ed2ee36e Mon Sep 17 00:00:00 2001

From: Johannes Weiner 
Date: Fri, 24 Feb 2017 10:56:32 -0500
Subject: [PATCH] mm: vmscan: disable kswapd on unreclaimable nodes

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

Reported-by: Jia He 
Signed-off-by: Johannes Weiner 
---
 include/linux/mmzone.h |  2 ++
 include/linux/swap.h   |  1 +
 mm/page_alloc.c|  6 --
 mm/vmscan.c| 20 
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..d2c50ab6ae40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -630,6 +630,8 @@ typedef struct pglist_data {
int kswapd_order;
enum zone_type kswapd_classzone_idx;

+   int kswapd_failures;/* Number of 'reclaimed == 0' runs */
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd6716d..5c06581a730b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -288,6 +288,7 @@ extern void lru_cache_add_active_or_unevictable(struct page 
*page,
struct vm_area_struct *vma);

 /* linux/mm/vmscan.c */
+#define MAX_RECLAIM_RETRIES 16
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..83f0442f07fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }

 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward 
progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..8e9bdd172182 100644

Fwd: Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there

2017-02-22 Thread hejianet


resend it to lkml only.

 Forwarded Message 
Subject: Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
Date: Thu, 23 Feb 2017 10:46:01 +0800
From: hejianet <hejia...@gmail.com>
To: Johannes Weiner <han...@cmpxchg.org>
CC: linux...@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton 
<a...@linux-foundation.org>, Mel Gorman <mgor...@techsingularity.net>, Vlastimil 
Babka <vba...@suse.cz>, Michal Hocko <mho...@suse.com>, Minchan Kim 
<minc...@kernel.org>, Rik van Riel <r...@redhat.com>


sorry, resend it due to a delivery-failure:
"Wrong MIME labeling on 8-bit character texts"
I am sorry if anybody received it twice

Hi Johannes
On 23/02/2017 4:16 AM, Johannes Weiner wrote:

On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:

When I try to dynamically allocate the hugepages more than system total
free memory:



Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
will not be about to end)



The root cause is kswapd3 is trying to do relaim again and again but it
makes no progress



At that time, there are no relaimable pages in that node:


Yes, this is a problem with the current kswapd code.

A less artificial scenario that I observed recently was machines with
two NUMA nodes, after being up for 200+ days, getting into a state
where node0 is mostly consumed by anon and some kernel allocations,
leaving less than the high watermark free. The machines don't have
swap, so the anon isn't reclaimable. But also, anon LRU is never even
*scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
spinning at 100% CPU calculating scan counts and checking zone states.

One specific problem with your patch, Jia, is that there might be some
cache pages that are pinned one way or another. That was the case on
our machines, and so reclaimable pages wasn't 0. Even if we check the
reclaimable pages, we need a hard cutoff after X attempts. And then it
sounds pretty much like what the allocator/direct reclaim already does.

Can we use the *exact* same cutoff conditions for direct reclaim and
kswapd, though? I don't think so. For direct reclaim, the goal is the
watermark, to make an allocation happen in the caller. While kswapd
tries to restore the watermarks too, it might never meet them but
still do useful work on behalf of concurrently allocating threads. It
should only stop when it tries and fails to free any pages at all.


Yes, this is what I thought before this patch, but seems Michal
doesn't like this idea :)
Please see https://lkml.org/lkml/2017/1/24/543

Frankly speaking, it did a little impact but not so much on
kswapd high cpu usage(not tested your patch here but I've tested
my 1st patch)

Because in the case when the memory in nodeN is mostly consumed,
any direct reclaim path will try to wake up the kswapd:
__alloc_pages_slowpath
wake_all_kswapds
wakeup_kswapd
retry for 16 times

Compared with the idea which limited the no progress loop of kswapd,
avoiding wakeup kswapd can have significant impact on the high cpu
usage.

B.R.
Jia

The recent removal of the scanned > reclaimable * 6 exit condition
from kswapd was a mistake, we need something like it. But it's also
broken the way we perform reclaim on swapless systems, so instead of
re-instating it, we should remove the remnants and do something new.

Can we simply count the number of balance_pgdat() runs that didn't
reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?

And a follow-up: once it gives up, when should kswapd return to work?
We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
that's a branch in a common allocator path, just to recover kswapd - a
latency tool, not a necessity for functional correctness - from a
situation that's exceedingly pretty rare. How about we leave it
disabled until a direct reclaimer manages to free something?

Something like this (untested):

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96c78d840916..611c5fb52d7b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -149,7 +149,6 @@ enum node_stat_item {
NR_UNEVICTABLE, /*  " " "   "   " */
NR_ISOLATED_ANON,   /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE,   /* Temporary isolated pages from file lru */
-   NR_PAGES_SCANNED,   /* pages scanned since last reclaim */
PGSCAN_ANON,
PGSCAN_FILE,
PGSTEAL_ANON,
@@ -630,6 +629,8 @@ typedef struct pglist_data {
int kswapd_order;
enum zone_type kswapd_classzone_idx;

+   int kswapd_failed_runs;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 145005e6dc85..9de49e2710af 100644
--- a/include/linux/swap.h
+++ b/include/lin

Fwd: Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there

2017-02-22 Thread hejianet


resend it to lkml only.

 Forwarded Message 
Subject: Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there
Date: Thu, 23 Feb 2017 10:46:01 +0800
From: hejianet 
To: Johannes Weiner 
CC: linux...@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton 
, Mel Gorman , Vlastimil 
Babka , Michal Hocko , Minchan Kim 
, Rik van Riel 


sorry, resend it due to a delivery-failure:
"Wrong MIME labeling on 8-bit character texts"
I am sorry if anybody received it twice

Hi Johannes
On 23/02/2017 4:16 AM, Johannes Weiner wrote:

On Wed, Feb 22, 2017 at 05:04:48PM +0800, Jia He wrote:

When I try to dynamically allocate the hugepages more than system total
free memory:



Then the kswapd will take 100% cpu for a long time(more than 3 hours, and
will not be about to end)



The root cause is kswapd3 is trying to do relaim again and again but it
makes no progress



At that time, there are no relaimable pages in that node:


Yes, this is a problem with the current kswapd code.

A less artificial scenario that I observed recently was machines with
two NUMA nodes, after being up for 200+ days, getting into a state
where node0 is mostly consumed by anon and some kernel allocations,
leaving less than the high watermark free. The machines don't have
swap, so the anon isn't reclaimable. But also, anon LRU is never even
*scanned*, so the "all unreclaimable" logic doesn't kick in. Kswapd is
spinning at 100% CPU calculating scan counts and checking zone states.

One specific problem with your patch, Jia, is that there might be some
cache pages that are pinned one way or another. That was the case on
our machines, and so reclaimable pages wasn't 0. Even if we check the
reclaimable pages, we need a hard cutoff after X attempts. And then it
sounds pretty much like what the allocator/direct reclaim already does.

Can we use the *exact* same cutoff conditions for direct reclaim and
kswapd, though? I don't think so. For direct reclaim, the goal is the
watermark, to make an allocation happen in the caller. While kswapd
tries to restore the watermarks too, it might never meet them but
still do useful work on behalf of concurrently allocating threads. It
should only stop when it tries and fails to free any pages at all.


Yes, this is what I thought before this patch, but seems Michal
doesn't like this idea :)
Please see https://lkml.org/lkml/2017/1/24/543

Frankly speaking, it did a little impact but not so much on
kswapd high cpu usage(not tested your patch here but I've tested
my 1st patch)

Because in the case when the memory in nodeN is mostly consumed,
any direct reclaim path will try to wake up the kswapd:
__alloc_pages_slowpath
wake_all_kswapds
wakeup_kswapd
retry for 16 times

Compared with the idea which limited the no progress loop of kswapd,
avoiding wakeup kswapd can have significant impact on the high cpu
usage.

B.R.
Jia

The recent removal of the scanned > reclaimable * 6 exit condition
from kswapd was a mistake, we need something like it. But it's also
broken the way we perform reclaim on swapless systems, so instead of
re-instating it, we should remove the remnants and do something new.

Can we simply count the number of balance_pgdat() runs that didn't
reclaim anything and have kswapd sleep after MAX_RECLAIM_RETRIES?

And a follow-up: once it gives up, when should kswapd return to work?
We used to reset NR_PAGES_SCANNED whenever a page gets freed. But
that's a branch in a common allocator path, just to recover kswapd - a
latency tool, not a necessity for functional correctness - from a
situation that's exceedingly pretty rare. How about we leave it
disabled until a direct reclaimer manages to free something?

Something like this (untested):

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96c78d840916..611c5fb52d7b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -149,7 +149,6 @@ enum node_stat_item {
NR_UNEVICTABLE, /*  " " "   "   " */
NR_ISOLATED_ANON,   /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE,   /* Temporary isolated pages from file lru */
-   NR_PAGES_SCANNED,   /* pages scanned since last reclaim */
PGSCAN_ANON,
PGSCAN_FILE,
PGSTEAL_ANON,
@@ -630,6 +629,8 @@ typedef struct pglist_data {
int kswapd_order;
enum zone_type kswapd_classzone_idx;

+   int kswapd_failed_runs;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 145005e6dc85..9de49e2710af 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -285,8 +285,8 @@ extern void lru_cache_add_active_or_unevictable(struct page 
*page,
struct vm_area_struct *vma);

 /* linux/mm/vmscan.c */
+#def

Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there

2017-02-22 Thread hejianet

Hi Michal

On 22/02/2017 11:48 PM, Michal Hocko wrote:

On Wed 22-02-17 22:31:50, hejianet wrote:

Hi Michal

On 22/02/2017 7:41 PM, Michal Hocko wrote:

On Wed 22-02-17 17:04:48, Jia He wrote:

When I try to dynamically allocate the hugepages more than system total
free memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages


I assume that the command has terminated with less huge pages allocated
than requested but


Yes, at last the allocated hugepages are less than 4000
HugePages_Total:1864
HugePages_Free: 1864
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:  16384 kB

In the bad case, although kswapd takes 100% cpu, the number of
HugePages_Total is not increase at all.


Node 3, zone  DMA

[...]

  pages free 2951
min  2821
low  3526
high 4231


it left the zone below high watermark with


   node_scanned  0
spanned  245760
present  245760
managed  245388
  nr_free_pages 2951
  nr_zone_inactive_anon 0
  nr_zone_active_anon 0
  nr_zone_inactive_file 0
  nr_zone_active_file 0


no pages reclaimable, so kswapd will not go to sleep. It would be quite
easy and comfortable to call it a misconfiguration but it seems that
it might be quite easy to hit with NUMA machines which have large
differences in the node sizes. I guess it makes sense to back off
the kswapd rather than burning CPU without any way to make forward
progress.


agree.


please make sure that this information is in the changelog

[...]

@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
 {
pg_data_t *pgdat;
int z;
+   int node_has_relaimable_pages = 0;

if (!managed_zone(zone))
return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)

if (zone_balanced(zone, order, classzone_idx))
return;
+
+   if (!zone_reclaimable_pages(zone))
+   node_has_relaimable_pages = 1;


What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?


I mean, if any one zone has reclaimable pages, then this zone's *node* has
reclaimable pages. Thus, the kswapN for this node should be waken up.
e.g. node 1 has 2 zones.
zone A has no reclaimable pages but zone B has.
Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
pages number. So I prefer the name node_has_relaimable_pages instead of
zone_has_relaimable_pages


I still do not understand. This code starts with
node_has_relaimable_pages = 0. If you see a zone with no reclaimable
pages then you make it node_has_relaimable_pages = 1 which means that


+   /* Dont wake kswapd if no reclaimable pages */
+   if (!node_has_relaimable_pages)
+   return;


this will not hold and we will wake up the kswapd. I believe what
you want instead, is to skip the wake up if _all_ zones have
!zone_reclaimable_pages() Or I am missing your point. This means that
you want
if (zone_reclaimable_pages(zone)
node_has_relaimable_pages = 1;  

You are right, will send v2 soon after testing it
B.R.
Jia



+
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
wake_up_interruptible(>kswapd_wait);




Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there

2017-02-22 Thread hejianet

Hi Michal

On 22/02/2017 11:48 PM, Michal Hocko wrote:

On Wed 22-02-17 22:31:50, hejianet wrote:

Hi Michal

On 22/02/2017 7:41 PM, Michal Hocko wrote:

On Wed 22-02-17 17:04:48, Jia He wrote:

When I try to dynamically allocate the hugepages more than system total
free memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages


I assume that the command has terminated with less huge pages allocated
than requested but


Yes, at last the allocated hugepages are less than 4000
HugePages_Total:1864
HugePages_Free: 1864
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:  16384 kB

In the bad case, although kswapd takes 100% cpu, the number of
HugePages_Total is not increase at all.


Node 3, zone  DMA

[...]

  pages free 2951
min  2821
low  3526
high 4231


it left the zone below high watermark with


   node_scanned  0
spanned  245760
present  245760
managed  245388
  nr_free_pages 2951
  nr_zone_inactive_anon 0
  nr_zone_active_anon 0
  nr_zone_inactive_file 0
  nr_zone_active_file 0


no pages reclaimable, so kswapd will not go to sleep. It would be quite
easy and comfortable to call it a misconfiguration but it seems that
it might be quite easy to hit with NUMA machines which have large
differences in the node sizes. I guess it makes sense to back off
the kswapd rather than burning CPU without any way to make forward
progress.


agree.


please make sure that this information is in the changelog

[...]

@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
 {
pg_data_t *pgdat;
int z;
+   int node_has_relaimable_pages = 0;

if (!managed_zone(zone))
return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)

if (zone_balanced(zone, order, classzone_idx))
return;
+
+   if (!zone_reclaimable_pages(zone))
+   node_has_relaimable_pages = 1;


What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?


I mean, if any one zone has reclaimable pages, then this zone's *node* has
reclaimable pages. Thus, the kswapN for this node should be waken up.
e.g. node 1 has 2 zones.
zone A has no reclaimable pages but zone B has.
Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
pages number. So I prefer the name node_has_relaimable_pages instead of
zone_has_relaimable_pages


I still do not understand. This code starts with
node_has_relaimable_pages = 0. If you see a zone with no reclaimable
pages then you make it node_has_relaimable_pages = 1 which means that


+   /* Dont wake kswapd if no reclaimable pages */
+   if (!node_has_relaimable_pages)
+   return;


this will not hold and we will wake up the kswapd. I believe what
you want instead, is to skip the wake up if _all_ zones have
!zone_reclaimable_pages() Or I am missing your point. This means that
you want
if (zone_reclaimable_pages(zone)
node_has_relaimable_pages = 1;  

You are right, will send v2 soon after testing it
B.R.
Jia



+
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
wake_up_interruptible(>kswapd_wait);




Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there

2017-02-22 Thread hejianet

Hi Michal

On 22/02/2017 7:41 PM, Michal Hocko wrote:

On Wed 22-02-17 17:04:48, Jia He wrote:

When I try to dynamically allocate the hugepages more than system total
free memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages


I assume that the command has terminated with less huge pages allocated
than requested but


Yes, at last the allocated hugepages are less than 4000
HugePages_Total:1864
HugePages_Free: 1864
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:  16384 kB

In the bad case, although kswapd takes 100% cpu, the number of
HugePages_Total is not increase at all.


Node 3, zone  DMA

[...]

  pages free 2951
min  2821
low  3526
high 4231


it left the zone below high watermark with


   node_scanned  0
spanned  245760
present  245760
managed  245388
  nr_free_pages 2951
  nr_zone_inactive_anon 0
  nr_zone_active_anon 0
  nr_zone_inactive_file 0
  nr_zone_active_file 0


no pages reclaimable, so kswapd will not go to sleep. It would be quite
easy and comfortable to call it a misconfiguration but it seems that
it might be quite easy to hit with NUMA machines which have large
differences in the node sizes. I guess it makes sense to back off
the kswapd rather than burning CPU without any way to make forward
progress.

agree.


[...]


diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..a05e3ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, int classzone_idx)
if (!managed_zone(zone))
continue;

-   if (!zone_balanced(zone, order, classzone_idx))
+   if (!zone_balanced(zone, order, classzone_idx)
+   && zone_reclaimable_pages(zone))
return false;


OK, this makes some sense, although zone_reclaimable_pages doesn't count
SLAB reclaimable pages. So we might go to sleep with a reclaimable slab
still around. This is not really easy to address because the reclaimable
slab doesn't really imply that those pages will be reclaimed...

Yes, even in the bad case, when kswapd takes all the cpu, the reclaimable
pages are not decreased



}

@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
 {
pg_data_t *pgdat;
int z;
+   int node_has_relaimable_pages = 0;

if (!managed_zone(zone))
return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)

if (zone_balanced(zone, order, classzone_idx))
return;
+
+   if (!zone_reclaimable_pages(zone))
+   node_has_relaimable_pages = 1;


What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?

I mean, if any one zone has reclaimable pages, then this zone's *node* has
reclaimable pages. Thus, the kswapN for this node should be waken up.
e.g. node 1 has 2 zones.
zone A has no reclaimable pages but zone B has.
Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
pages number. So I prefer the name node_has_relaimable_pages instead of
zone_has_relaimable_pages

Did I understand it correctly? Thanks

B.R.
Jia



}

+   /* Dont wake kswapd if no reclaimable pages */
+   if (!node_has_relaimable_pages)
+   return;
+
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
wake_up_interruptible(>kswapd_wait);
 }
--
1.8.5.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 




Re: [RFC PATCH] mm/vmscan: fix high cpu usage of kswapd if there

2017-02-22 Thread hejianet

Hi Michal

On 22/02/2017 7:41 PM, Michal Hocko wrote:

On Wed 22-02-17 17:04:48, Jia He wrote:

When I try to dynamically allocate the hugepages more than system total
free memory:
e.g. echo 4000 >/proc/sys/vm/nr_hugepages


I assume that the command has terminated with less huge pages allocated
than requested but


Yes, at last the allocated hugepages are less than 4000
HugePages_Total:1864
HugePages_Free: 1864
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:  16384 kB

In the bad case, although kswapd takes 100% cpu, the number of
HugePages_Total is not increase at all.


Node 3, zone  DMA

[...]

  pages free 2951
min  2821
low  3526
high 4231


it left the zone below high watermark with


   node_scanned  0
spanned  245760
present  245760
managed  245388
  nr_free_pages 2951
  nr_zone_inactive_anon 0
  nr_zone_active_anon 0
  nr_zone_inactive_file 0
  nr_zone_active_file 0


no pages reclaimable, so kswapd will not go to sleep. It would be quite
easy and comfortable to call it a misconfiguration but it seems that
it might be quite easy to hit with NUMA machines which have large
differences in the node sizes. I guess it makes sense to back off
the kswapd rather than burning CPU without any way to make forward
progress.

agree.


[...]


diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..a05e3ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3139,7 +3139,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, int classzone_idx)
if (!managed_zone(zone))
continue;

-   if (!zone_balanced(zone, order, classzone_idx))
+   if (!zone_balanced(zone, order, classzone_idx)
+   && zone_reclaimable_pages(zone))
return false;


OK, this makes some sense, although zone_reclaimable_pages doesn't count
SLAB reclaimable pages. So we might go to sleep with a reclaimable slab
still around. This is not really easy to address because the reclaimable
slab doesn't really imply that those pages will be reclaimed...

Yes, even in the bad case, when kswapd takes all the cpu, the reclaimable
pages are not decreased



}

@@ -3502,6 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
 {
pg_data_t *pgdat;
int z;
+   int node_has_relaimable_pages = 0;

if (!managed_zone(zone))
return;
@@ -3522,8 +3524,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)

if (zone_balanced(zone, order, classzone_idx))
return;
+
+   if (!zone_reclaimable_pages(zone))
+   node_has_relaimable_pages = 1;


What, this doesn't make any sense? Did you mean if (zone_reclaimable_pages)?

I mean, if any one zone has reclaimable pages, then this zone's *node* has
reclaimable pages. Thus, the kswapN for this node should be waken up.
e.g. node 1 has 2 zones.
zone A has no reclaimable pages but zone B has.
Thus node 1 has reclaimable pages, and kswapd1 will be waken up.
I use node_has_relaimable_pages in the loop to check all the zones' reclaimable
pages number. So I prefer the name node_has_relaimable_pages instead of
zone_has_relaimable_pages

Did I understand it correctly? Thanks

B.R.
Jia



}

+   /* Dont wake kswapd if no reclaimable pages */
+   if (!node_has_relaimable_pages)
+   return;
+
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
wake_up_interruptible(>kswapd_wait);
 }
--
1.8.5.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 




Re: [GIT PULL] cputime: Convert core use of cputime_t to nsecs

2017-02-20 Thread hejianet

Hi Frederic

Do we need to support CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=n on ppc64le?

If yes, I got a compiling error after applying your patch set:

arch/powerpc/kernel/time.c:712:2: error: implicit declaration of function 
‘cputime_to_nsecs’ [-Werror=implicit-function-declaration]
  return local_clock() - 
cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);

  ^

I thought it is due to CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=n on my ppc64le server.
My kernel config:
# grep -n CONFIG_VIRT_CPU .config
136:CONFIG_VIRT_CPU_ACCOUNTING=y
137:CONFIG_VIRT_CPU_ACCOUNTING_GEN=y

B.R.
Jia

On 30/01/2017 12:46 PM, Frederic Weisbecker wrote:

Ingo,

Please pull the cputime/nsecs-for-tip branch that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
cputime/nsecs-for-tip

HEAD: 14d889bef71ff808c450f82bcd257b10f05bb061

The patches are the same than on my previous post: 
https://lwn.net/Articles/712213/
If you apply them by email, just please ignore the very last one
([PATCH 37/37] s390: Prevent from cputime leaks) because we need to find
a better solution with Martin. The branch doesn't have this patch.


--- Summary ---

cputime_t is a type that can map to different time units and
granularities: jiffies, nsecs or architecture clock.

This type and its accessors and mutators have been designed to deal
with all these time units that can vary depending on the kernel config
in order to support a model where the cputime is stored as-is under
the source unit.

The pro here with this model is to avoid expensive conversions from the
source unit cputime to a more generic type during the accounting hotpath.
Especially for config that have CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y.

Now there are several cons:

* we need to maintain a whole set of cputime_t mutators and accessors
  for all implementations of cputime_t (currently 4 of them). And we
  need such function for every kind of time conversion: to/from
  jiffies, nsecs, usecs, timeval, clock_t, ...

* The core code needs to deal with different possible granularities
  of cputime_t while converting to/from another time unit.
  Especially functions like nsecs_to_cputime() can leak some nsecs
  remainder. This adds more complexity and even sometimes performance
  loss (involving reverse conversion) in order to avoid losing such
  time remainder (eg: irqtime accounting, steal time accounting, posix
  cpu timers, ...).

* Kernel developers are seldom familiar with these granularity issues:
  cputime leaks often appear in patches dealing with cputime code.

* In general cputime_t, as a varying type, is more opaque and harder
  to deal with than static nsecs. Making the core code less readable.

This patchset removes all core use of cputime_t and stores the cputime
into nsecs units. Only s390 and powerpc
(with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE for the latter) now still use
cputime_t.

Many code get simplified, the diffstat is appealing and some fastpath
should even have a small performance gain (irqtime and steal time
accounting).

Now lets admit one drawback: s390 and powerpc with
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE have new cputime_t to nsecs conversion
on cputime accounting path. But this should be leveraged by the recent
changes which delay the cputime accounting to tick and context switch.

Thanks,
Frederic
---

Frederic Weisbecker (36):
   jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY
   time: Introduce jiffies64_to_nsecs()
   sched: Remove unused INIT_CPUTIME macro
   cputime: Convert kcpustat to nsecs
   macintosh/rack-meter: Remove cputime_t internal use
   cputime: Convert guest time accounting to nsecs
   cputime: Special API to return old-typed cputime
   cputime: Convert task/group cputime to nsecs
   alpha: Convert obsolete cputime_t to nsecs
   x86: Convert obsolete cputime type to nsecs
   isdn: Convert obsolete cputime type to nsecs
   binfmt: Convert obsolete cputime type to nsecs
   acct: Convert obsolete cputime type to nsecs
   delaycct: Convert obsolete cputime type to nsecs
   tsacct: Convert obsolete cputime type to nsecs
   signal: Convert obsolete cputime type to nsecs
   cputime: Increment kcpustat directly on irqtime account
   posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc calculated 
version
   posix-timers: Convert internals to use nsecs
   itimer: Convert internal cputime_t units to nsec
   sched: Remove temporary cputime_t accessors
   cputime: Push time to account_user_time() in nsecs
   cputime: Push time to account_steal_time() in nsecs
   cputime: Push time to account_idle_time() in nsecs
   cputime: Push time to account_system_time() in nsecs
   cputime: Complete nsec conversion of tick based accounting
   vtime: Return nsecs instead of cputime_t to account
   cputime: Remove jiffies based cputime
 

Re: [GIT PULL] cputime: Convert core use of cputime_t to nsecs

2017-02-20 Thread hejianet

Hi Frederic

Do we need to support CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=n on ppc64le?

If yes, I got a compiling error after applying your patch set:

arch/powerpc/kernel/time.c:712:2: error: implicit declaration of function 
‘cputime_to_nsecs’ [-Werror=implicit-function-declaration]
  return local_clock() - 
cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);

  ^

I thought it is due to CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=n on my ppc64le server.
My kernel config:
# grep -n CONFIG_VIRT_CPU .config
136:CONFIG_VIRT_CPU_ACCOUNTING=y
137:CONFIG_VIRT_CPU_ACCOUNTING_GEN=y

B.R.
Jia

On 30/01/2017 12:46 PM, Frederic Weisbecker wrote:

Ingo,

Please pull the cputime/nsecs-for-tip branch that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
cputime/nsecs-for-tip

HEAD: 14d889bef71ff808c450f82bcd257b10f05bb061

The patches are the same than on my previous post: 
https://lwn.net/Articles/712213/
If you apply them by email, just please ignore the very last one
([PATCH 37/37] s390: Prevent from cputime leaks) because we need to find
a better solution with Martin. The branch doesn't have this patch.


--- Summary ---

cputime_t is a type that can map to different time units and
granularities: jiffies, nsecs or architecture clock.

This type and its accessors and mutators have been designed to deal
with all these time units that can vary depending on the kernel config
in order to support a model where the cputime is stored as-is under
the source unit.

The pro here with this model is to avoid expensive conversions from the
source unit cputime to a more generic type during the accounting hotpath.
Especially for config that have CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y.

Now there are several cons:

* we need to maintain a whole set of cputime_t mutators and accessors
  for all implementations of cputime_t (currently 4 of them). And we
  need such function for every kind of time conversion: to/from
  jiffies, nsecs, usecs, timeval, clock_t, ...

* The core code needs to deal with different possible granularities
  of cputime_t while converting to/from another time unit.
  Especially functions like nsecs_to_cputime() can leak some nsecs
  remainder. This adds more complexity and even sometimes performance
  loss (involving reverse conversion) in order to avoid losing such
  time remainder (eg: irqtime accounting, steal time accounting, posix
  cpu timers, ...).

* Kernel developers are seldom familiar with these granularity issues:
  cputime leaks often appear in patches dealing with cputime code.

* In general cputime_t, as a varying type, is more opaque and harder
  to deal with than static nsecs. Making the core code less readable.

This patchset removes all core use of cputime_t and stores the cputime
into nsecs units. Only s390 and powerpc
(with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE for the latter) now still use
cputime_t.

Many code get simplified, the diffstat is appealing and some fastpath
should even have a small performance gain (irqtime and steal time
accounting).

Now lets admit one drawback: s390 and powerpc with
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE have new cputime_t to nsecs conversion
on cputime accounting path. But this should be leveraged by the recent
changes which delay the cputime accounting to tick and context switch.

Thanks,
Frederic
---

Frederic Weisbecker (36):
   jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY
   time: Introduce jiffies64_to_nsecs()
   sched: Remove unused INIT_CPUTIME macro
   cputime: Convert kcpustat to nsecs
   macintosh/rack-meter: Remove cputime_t internal use
   cputime: Convert guest time accounting to nsecs
   cputime: Special API to return old-typed cputime
   cputime: Convert task/group cputime to nsecs
   alpha: Convert obsolete cputime_t to nsecs
   x86: Convert obsolete cputime type to nsecs
   isdn: Convert obsolete cputime type to nsecs
   binfmt: Convert obsolete cputime type to nsecs
   acct: Convert obsolete cputime type to nsecs
   delaycct: Convert obsolete cputime type to nsecs
   tsacct: Convert obsolete cputime type to nsecs
   signal: Convert obsolete cputime type to nsecs
   cputime: Increment kcpustat directly on irqtime account
   posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc calculated 
version
   posix-timers: Convert internals to use nsecs
   itimer: Convert internal cputime_t units to nsec
   sched: Remove temporary cputime_t accessors
   cputime: Push time to account_user_time() in nsecs
   cputime: Push time to account_steal_time() in nsecs
   cputime: Push time to account_idle_time() in nsecs
   cputime: Push time to account_system_time() in nsecs
   cputime: Complete nsec conversion of tick based accounting
   vtime: Return nsecs instead of cputime_t to account
   cputime: Remove jiffies based cputime
 

Re: [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made

2017-01-24 Thread hejianet



On 25/01/2017 12:54 AM, Michal Hocko wrote:

On Tue 24-01-17 15:49:03, Jia He wrote:

Currently there is no hard limitation for kswapd retry times if no progress
is made.


Yes, because the main objective of the kswapd is to balance all memory
zones. So having a hard limit on retries doesn't make any sense.


But do you think even when there is no any process, kswapd still need
to run and take the cpu usage uselessly?


Then kswapd will take 100% for a long time.


Where it is spending time?

I've watched kswapd takes 100% cpu for a whole night.




In my test, I tried to allocate 4000 hugepages by:
echo 4000 > /proc/sys/vm/nr_hugepages

Then,kswapd will take 100% cpu for a long time.

The numa layout is:
available: 7 nodes (0-6)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 6611 MB
node 0 free: 1103 MB
node 1 cpus:
node 1 size: 12527 MB
node 1 free: 8477 MB
node 2 cpus:
node 2 size: 15087 MB
node 2 free: 11037 MB
node 3 cpus:
node 3 size: 16111 MB
node 3 free: 12060 MB
node 4 cpus: 8 9 10 11 12 13 14 15
node 4 size: 24815 MB
node 4 free: 20704 MB
node 5 cpus:
node 5 size: 4095 MB
node 5 free: 61 MB
node 6 cpus:
node 6 size: 22750 MB
node 6 free: 18716 MB

The cause is kswapd will loop for long time even if there is no progress in
balance_pgdat.


How does this solve anything? If the kswapd just backs off then the more
work has to be done in the direct reclaim context.

What if there is still no progress in direct context?

B.R.
Jia



Signed-off-by: Jia He 
---
 mm/vmscan.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..7396a0a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -59,6 +59,7 @@
 #define CREATE_TRACE_POINTS
 #include 

+#define MAX_KSWAPD_RECLAIM_RETRIES 16
 struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
@@ -3202,7 +3203,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
  * or lower is eligible for reclaim until at least one usable zone is
  * balanced.
  */
-static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx,
+int *did_some_progress)
 {
int i;
unsigned long nr_soft_reclaimed;
@@ -3322,6 +3324,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 * entered the allocator slow path while kswapd was awake, order will
 * remain at the higher level.
 */
+   *did_some_progress = !!(sc.nr_scanned || sc.nr_reclaimed);
return sc.order;
 }

@@ -3417,6 +3420,8 @@ static int kswapd(void *p)
unsigned int alloc_order, reclaim_order, classzone_idx;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
+   int no_progress_loops = 0;
+   int did_some_progress = 0;

struct reclaim_state reclaim_state = {
.reclaimed_slab = 0,
@@ -3480,9 +3485,23 @@ static int kswapd(void *p)
 */
trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
alloc_order);
-   reclaim_order = balance_pgdat(pgdat, alloc_order, 
classzone_idx);
-   if (reclaim_order < alloc_order)
+   reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx,
+   _some_progress);
+
+   if (reclaim_order < alloc_order) {
+   no_progress_loops = 0;
goto kswapd_try_sleep;
+   }
+
+   if (did_some_progress)
+   no_progress_loops = 0;
+   else
+   no_progress_loops++;
+
+   if (no_progress_loops >= MAX_KSWAPD_RECLAIM_RETRIES) {
+   no_progress_loops = 0;
+   goto kswapd_try_sleep;
+   }

alloc_order = reclaim_order = pgdat->kswapd_order;
classzone_idx = pgdat->kswapd_classzone_idx;
--
2.5.5





Re: [PATCH RFC 2/3] mm, vmscan: limit kswapd loop if no progress is made

2017-01-24 Thread hejianet



On 25/01/2017 12:54 AM, Michal Hocko wrote:

On Tue 24-01-17 15:49:03, Jia He wrote:

Currently there is no hard limitation for kswapd retry times if no progress
is made.


Yes, because the main objective of the kswapd is to balance all memory
zones. So having a hard limit on retries doesn't make any sense.


But do you think even when there is no any process, kswapd still need
to run and take the cpu usage uselessly?


Then kswapd will take 100% for a long time.


Where it is spending time?

I've watched kswapd takes 100% cpu for a whole night.




In my test, I tried to allocate 4000 hugepages by:
echo 4000 > /proc/sys/vm/nr_hugepages

Then,kswapd will take 100% cpu for a long time.

The numa layout is:
available: 7 nodes (0-6)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 6611 MB
node 0 free: 1103 MB
node 1 cpus:
node 1 size: 12527 MB
node 1 free: 8477 MB
node 2 cpus:
node 2 size: 15087 MB
node 2 free: 11037 MB
node 3 cpus:
node 3 size: 16111 MB
node 3 free: 12060 MB
node 4 cpus: 8 9 10 11 12 13 14 15
node 4 size: 24815 MB
node 4 free: 20704 MB
node 5 cpus:
node 5 size: 4095 MB
node 5 free: 61 MB
node 6 cpus:
node 6 size: 22750 MB
node 6 free: 18716 MB

The cause is kswapd will loop for long time even if there is no progress in
balance_pgdat.


How does this solve anything? If the kswapd just backs off then the more
work has to be done in the direct reclaim context.

What if there is still no progress in direct context?

B.R.
Jia



Signed-off-by: Jia He 
---
 mm/vmscan.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a7..7396a0a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -59,6 +59,7 @@
 #define CREATE_TRACE_POINTS
 #include 

+#define MAX_KSWAPD_RECLAIM_RETRIES 16
 struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
@@ -3202,7 +3203,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
  * or lower is eligible for reclaim until at least one usable zone is
  * balanced.
  */
-static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx,
+int *did_some_progress)
 {
int i;
unsigned long nr_soft_reclaimed;
@@ -3322,6 +3324,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 * entered the allocator slow path while kswapd was awake, order will
 * remain at the higher level.
 */
+   *did_some_progress = !!(sc.nr_scanned || sc.nr_reclaimed);
return sc.order;
 }

@@ -3417,6 +3420,8 @@ static int kswapd(void *p)
unsigned int alloc_order, reclaim_order, classzone_idx;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
+   int no_progress_loops = 0;
+   int did_some_progress = 0;

struct reclaim_state reclaim_state = {
.reclaimed_slab = 0,
@@ -3480,9 +3485,23 @@ static int kswapd(void *p)
 */
trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
alloc_order);
-   reclaim_order = balance_pgdat(pgdat, alloc_order, 
classzone_idx);
-   if (reclaim_order < alloc_order)
+   reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx,
+   _some_progress);
+
+   if (reclaim_order < alloc_order) {
+   no_progress_loops = 0;
goto kswapd_try_sleep;
+   }
+
+   if (did_some_progress)
+   no_progress_loops = 0;
+   else
+   no_progress_loops++;
+
+   if (no_progress_loops >= MAX_KSWAPD_RECLAIM_RETRIES) {
+   no_progress_loops = 0;
+   goto kswapd_try_sleep;
+   }

alloc_order = reclaim_order = pgdat->kswapd_order;
classzone_idx = pgdat->kswapd_classzone_idx;
--
2.5.5





Re: [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value

2017-01-24 Thread hejianet



On 25/01/2017 6:01 AM, Rik van Riel wrote:

On Tue, 2017-01-24 at 15:49 +0800, Jia He wrote:

When there is no reclaimable pages in the zone, even the zone is
not balanced, we let kswapd go sleeping. That is prepare_kswapd_sleep
will return true in this case.

Signed-off-by: Jia He 
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7396a0a..54445e2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3140,7 +3140,8 @@ static bool prepare_kswapd_sleep(pg_data_t
*pgdat, int order, int classzone_idx)
if (!managed_zone(zone))
continue;

-   if (!zone_balanced(zone, order, classzone_idx))
+   if (!zone_balanced(zone, order, classzone_idx)
+   && !zone_reclaimable_pages(zone))
return false;
}


This patch does the opposite of what your changelog
says.  The above keeps kswapd running forever if
the zone is not balanced, and there are no reclaimable
pages.

sorry for the mistake, I will check what happened.
I tested in my local system.

B.R.
Jia





Re: [PATCH RFC 3/3] mm, vmscan: correct prepare_kswapd_sleep return value

2017-01-24 Thread hejianet



On 25/01/2017 6:01 AM, Rik van Riel wrote:

On Tue, 2017-01-24 at 15:49 +0800, Jia He wrote:

When there is no reclaimable pages in the zone, even the zone is
not balanced, we let kswapd go sleeping. That is prepare_kswapd_sleep
will return true in this case.

Signed-off-by: Jia He 
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7396a0a..54445e2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3140,7 +3140,8 @@ static bool prepare_kswapd_sleep(pg_data_t
*pgdat, int order, int classzone_idx)
if (!managed_zone(zone))
continue;

-   if (!zone_balanced(zone, order, classzone_idx))
+   if (!zone_balanced(zone, order, classzone_idx)
+   && !zone_reclaimable_pages(zone))
return false;
}


This patch does the opposite of what your changelog
says.  The above keeps kswapd running forever if
the zone is not balanced, and there are no reclaimable
pages.

sorry for the mistake, I will check what happened.
I tested in my local system.

B.R.
Jia





Re: [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage

2017-01-24 Thread hejianet

Hi Michal
Thanks for the comments, I will resend the patch as per your
comment after my 2 weeks vacation.

B.R.
Jia

On 25/01/2017 12:46 AM, Michal Hocko wrote:

On Tue 24-01-17 15:49:01, Jia He wrote:

If there is a server with uneven numa memory layout:
available: 7 nodes (0-6)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 6603 MB
node 0 free: 91 MB
node 1 cpus:
node 1 size: 12527 MB
node 1 free: 157 MB
node 2 cpus:
node 2 size: 15087 MB
node 2 free: 189 MB
node 3 cpus:
node 3 size: 16111 MB
node 3 free: 205 MB
node 4 cpus: 8 9 10 11 12 13 14 15
node 4 size: 24815 MB
node 4 free: 310 MB
node 5 cpus:
node 5 size: 4095 MB
node 5 free: 61 MB
node 6 cpus:
node 6 size: 22750 MB
node 6 free: 283 MB
node distances:
node   0   1   2   3   4   5   6
  0:  10  20  40  40  40  40  40
  1:  20  10  40  40  40  40  40
  2:  40  40  10  20  40  40  40
  3:  40  40  20  10  40  40  40
  4:  40  40  40  40  10  20  40
  5:  40  40  40  40  20  10  40
  6:  40  40  40  40  40  40  10

In this case node 5 has less memory and we will alloc the hugepages
from these nodes one by one after we trigger
echo 4000 > /proc/sys/vm/nr_hugepages

Then the kswapd5 will take 100% cpu for a long time. This is a livelock
issue in kswapd. This patch set fixes it.


It would be really helpful to describe what is the issue and whether it
is specific to the configuration above. Also a highlevel overview of the
fix and why it is the right approach would be appreciated.


The 3rd patch improves the kswapd's bad performance significantly.


Numbers?


Jia He (3):
  mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path
  mm, vmscan: limit kswapd loop if no progress is made
  mm, vmscan: correct prepare_kswapd_sleep return value

 mm/hugetlb.c |  9 +
 mm/vmscan.c  | 28 
 2 files changed, 33 insertions(+), 4 deletions(-)

--
2.5.5





Re: [PATCH RFC 0/3] optimize kswapd when it does reclaim for hugepage

2017-01-24 Thread hejianet

Hi Michal
Thanks for the comments, I will resend the patch as per your
comment after my 2 weeks vacation.

B.R.
Jia

On 25/01/2017 12:46 AM, Michal Hocko wrote:

On Tue 24-01-17 15:49:01, Jia He wrote:

If there is a server with uneven numa memory layout:
available: 7 nodes (0-6)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 6603 MB
node 0 free: 91 MB
node 1 cpus:
node 1 size: 12527 MB
node 1 free: 157 MB
node 2 cpus:
node 2 size: 15087 MB
node 2 free: 189 MB
node 3 cpus:
node 3 size: 16111 MB
node 3 free: 205 MB
node 4 cpus: 8 9 10 11 12 13 14 15
node 4 size: 24815 MB
node 4 free: 310 MB
node 5 cpus:
node 5 size: 4095 MB
node 5 free: 61 MB
node 6 cpus:
node 6 size: 22750 MB
node 6 free: 283 MB
node distances:
node   0   1   2   3   4   5   6
  0:  10  20  40  40  40  40  40
  1:  20  10  40  40  40  40  40
  2:  40  40  10  20  40  40  40
  3:  40  40  20  10  40  40  40
  4:  40  40  40  40  10  20  40
  5:  40  40  40  40  20  10  40
  6:  40  40  40  40  40  40  10

In this case node 5 has less memory and we will alloc the hugepages
from these nodes one by one after we trigger
echo 4000 > /proc/sys/vm/nr_hugepages

Then the kswapd5 will take 100% cpu for a long time. This is a livelock
issue in kswapd. This patch set fixes it.


It would be really helpful to describe what is the issue and whether it
is specific to the configuration above. Also a highlevel overview of the
fix and why it is the right approach would be appreciated.


The 3rd patch improves the kswapd's bad performance significantly.


Numbers?


Jia He (3):
  mm/hugetlb: split alloc_fresh_huge_page_node into fast and slow path
  mm, vmscan: limit kswapd loop if no progress is made
  mm, vmscan: correct prepare_kswapd_sleep return value

 mm/hugetlb.c |  9 +
 mm/vmscan.c  | 28 
 2 files changed, 33 insertions(+), 4 deletions(-)

--
2.5.5





Re: [PATCH v5 1/2] sysctl: introduce new proc handler proc_dobool

2017-01-04 Thread hejianet



On 05/01/2017 5:09 AM, J. Bruce Fields wrote:

On Thu, Dec 15, 2016 at 03:24:20PM +0800, Jia He wrote:

This is to let bool variable could be correctly displayed in
big/little endian sysctl procfs. sizeof(bool) is arch dependent,
proc_dobool should work in all arches.

Did Alexey Debriyan agree that this dealt with his objections?

Hi Alexey, any thoughts? ;-)
Thanks

B.R.
Jia

Also it'd be nice to have a sign-off from somebody that's actually
touched sysctl.c.  Then I could take it through the nfsd tree if that
doesn't cause anyone trouble.

--b.


Suggested-by: Pan Xinhui 
Signed-off-by: Jia He 
---
  include/linux/sysctl.h |  2 ++
  kernel/sysctl.c| 41 +
  2 files changed, 43 insertions(+)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index adf4e51..255a9c7 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -41,6 +41,8 @@ typedef int proc_handler (struct ctl_table *ctl, int write,
  
  extern int proc_dostring(struct ctl_table *, int,

 void __user *, size_t *, loff_t *);
+extern int proc_dobool(struct ctl_table *, int,
+   void __user *, size_t *, loff_t *);
  extern int proc_dointvec(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
  extern int proc_douintvec(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 706309f..c4bec65 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2112,6 +2112,20 @@ static int proc_put_char(void __user **buf, size_t 
*size, char c)
return 0;
  }
  
+static int do_proc_dobool_conv(bool *negp, unsigned long *lvalp,

+   int *valp,
+   int write, void *data)
+{
+   if (write)
+   *(bool *)valp = *lvalp;
+   else {
+   int val = *(bool *)valp;
+
+   *lvalp = (unsigned long)val;
+   }
+   return 0;
+}
+
  static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
 int *valp,
 int write, void *data)
@@ -2258,6 +2272,26 @@ static int do_proc_dointvec(struct ctl_table *table, int 
write,
  }
  
  /**

+ * proc_dobool - read/write a bool
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ * @ppos: file position
+ *
+ * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
+ * values from/to the user buffer, treated as an ASCII string.
+ *
+ * Returns 0 on success.
+ */
+int proc_dobool(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   return do_proc_dointvec(table, write, buffer, lenp, ppos,
+   do_proc_dobool_conv, NULL);
+}
+
+/**
   * proc_dointvec - read a vector of integers
   * @table: the sysctl table
   * @write: %TRUE if this is a write to the sysctl file
@@ -2885,6 +2919,12 @@ int proc_dostring(struct ctl_table *table, int write,
return -ENOSYS;
  }
  
+int proc_dobool(struct ctl_table *table, int write,

+  void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+return -ENOSYS;
+}
+
  int proc_dointvec(struct ctl_table *table, int write,
  void __user *buffer, size_t *lenp, loff_t *ppos)
  {
@@ -2941,6 +2981,7 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table 
*table, int write,
   * No sense putting this after each symbol definition, twice,
   * exception granted :-)
   */
+EXPORT_SYMBOL(proc_dobool);
  EXPORT_SYMBOL(proc_dointvec);
  EXPORT_SYMBOL(proc_douintvec);
  EXPORT_SYMBOL(proc_dointvec_jiffies);
--
2.5.5




Re: [PATCH v5 1/2] sysctl: introduce new proc handler proc_dobool

2017-01-04 Thread hejianet



On 05/01/2017 5:09 AM, J. Bruce Fields wrote:

On Thu, Dec 15, 2016 at 03:24:20PM +0800, Jia He wrote:

This is to let bool variable could be correctly displayed in
big/little endian sysctl procfs. sizeof(bool) is arch dependent,
proc_dobool should work in all arches.

Did Alexey Debriyan agree that this dealt with his objections?

Hi Alexey, any thoughts? ;-)
Thanks

B.R.
Jia

Also it'd be nice to have a sign-off from somebody that's actually
touched sysctl.c.  Then I could take it through the nfsd tree if that
doesn't cause anyone trouble.

--b.


Suggested-by: Pan Xinhui 
Signed-off-by: Jia He 
---
  include/linux/sysctl.h |  2 ++
  kernel/sysctl.c| 41 +
  2 files changed, 43 insertions(+)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index adf4e51..255a9c7 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -41,6 +41,8 @@ typedef int proc_handler (struct ctl_table *ctl, int write,
  
  extern int proc_dostring(struct ctl_table *, int,

 void __user *, size_t *, loff_t *);
+extern int proc_dobool(struct ctl_table *, int,
+   void __user *, size_t *, loff_t *);
  extern int proc_dointvec(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
  extern int proc_douintvec(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 706309f..c4bec65 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2112,6 +2112,20 @@ static int proc_put_char(void __user **buf, size_t 
*size, char c)
return 0;
  }
  
+static int do_proc_dobool_conv(bool *negp, unsigned long *lvalp,

+   int *valp,
+   int write, void *data)
+{
+   if (write)
+   *(bool *)valp = *lvalp;
+   else {
+   int val = *(bool *)valp;
+
+   *lvalp = (unsigned long)val;
+   }
+   return 0;
+}
+
  static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
 int *valp,
 int write, void *data)
@@ -2258,6 +2272,26 @@ static int do_proc_dointvec(struct ctl_table *table, int 
write,
  }
  
  /**

+ * proc_dobool - read/write a bool
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ * @ppos: file position
+ *
+ * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
+ * values from/to the user buffer, treated as an ASCII string.
+ *
+ * Returns 0 on success.
+ */
+int proc_dobool(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   return do_proc_dointvec(table, write, buffer, lenp, ppos,
+   do_proc_dobool_conv, NULL);
+}
+
+/**
   * proc_dointvec - read a vector of integers
   * @table: the sysctl table
   * @write: %TRUE if this is a write to the sysctl file
@@ -2885,6 +2919,12 @@ int proc_dostring(struct ctl_table *table, int write,
return -ENOSYS;
  }
  
+int proc_dobool(struct ctl_table *table, int write,

+  void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+return -ENOSYS;
+}
+
  int proc_dointvec(struct ctl_table *table, int write,
  void __user *buffer, size_t *lenp, loff_t *ppos)
  {
@@ -2941,6 +2981,7 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table 
*table, int write,
   * No sense putting this after each symbol definition, twice,
   * exception granted :-)
   */
+EXPORT_SYMBOL(proc_dobool);
  EXPORT_SYMBOL(proc_dointvec);
  EXPORT_SYMBOL(proc_douintvec);
  EXPORT_SYMBOL(proc_dointvec_jiffies);
--
2.5.5




Re: [PATCH v5 0/2] change the proc handler for nsm_use_hostnames

2017-01-03 Thread hejianet

Ping, any comments are welcome. Thanks

B.R.

Jia


On 15/12/2016 3:24 PM, Jia He wrote:

nsm_use_hostnames is a module parameter and it will be exported to sysctl
procfs. This is to let user sometimes change it from userspace. But the
minimal unit for sysctl procfs read/write it sizeof(int).
In big endian system, the converting from/to  bool to/from int will cause
error for proc items.

This patch introduces a new proc handler proc_dobool for nsm_use_hostnames.

Changes:
v5: Fix compilation error when CONFIG_PROC_SYSCTL is not set
v4: Change (u8 *) to (bool *)
v3: Introduce a new proc handler proc_dou8vec(suggested by Xinhui Pan)
v2: Change extern type in lockd.h

The test case I used:
/***/
#include 
#include 
#include 

bool __read_mostly nsm_use_hostnames;
module_param(nsm_use_hostnames, bool, 0644);

static struct ctl_table my_sysctl[] = {
 {
 .procname   = "nsm_use_hostnames",
 .data   = _use_hostnames,
 .maxlen = sizeof(int),
 .mode   = 0644,
 .proc_handler   = _dointvec,
 },
 {}
};

static struct ctl_table my_root[] = {
 {
 .procname   = "mysysctl",
 .mode   = 0555,
 .child  = my_sysctl,
 },
 {}
};

static struct ctl_table_header * my_ctl_header;

static int __init sysctl_exam_init(void)
{
 my_ctl_header = register_sysctl_table(_root);
 if (my_ctl_header == NULL)
 printk("error regiester sysctl");

 return 0;
}

static void __exit sysctl_exam_exit(void)
{
 unregister_sysctl_table(my_ctl_header);
}

module_init(sysctl_exam_init);
module_exit(sysctl_exam_exit);
MODULE_LICENSE("GPL");
//

[root@bigendian my]# insmod -f /root/my/hello.ko nsm_use_hostnames=1
[root@bigendian my]# cat /proc/sys/mysysctl/nsm_use_hostnames
16777216

After I change the proc_dointvec to new handler proc_dou8vec with the
patch:
[root@bigendian my]# insmod -f /root/my/hello.ko nsm_use_hostnames=1
[root@bigendian my]# cat /proc/sys/mysysctl/nsm_use_hostnames
1

In little endian system, there is no such issue.
Already tested in both of big and little endian(ppc64 and ppc64le)

Jia He (2):
   sysctl: introduce new proc handler proc_dobool
   lockd: change the proc_handler for nsm_use_hostnames

  fs/lockd/svc.c |  2 +-
  include/linux/sysctl.h |  2 ++
  kernel/sysctl.c| 41 +
  3 files changed, 44 insertions(+), 1 deletion(-)




Re: [PATCH v5 0/2] change the proc handler for nsm_use_hostnames

2017-01-03 Thread hejianet

Ping, any comments are welcome. Thanks

B.R.

Jia


On 15/12/2016 3:24 PM, Jia He wrote:

nsm_use_hostnames is a module parameter and it will be exported to sysctl
procfs. This is to let user sometimes change it from userspace. But the
minimal unit for sysctl procfs read/write it sizeof(int).
In big endian system, the converting from/to  bool to/from int will cause
error for proc items.

This patch introduces a new proc handler proc_dobool for nsm_use_hostnames.

Changes:
v5: Fix compilation error when CONFIG_PROC_SYSCTL is not set
v4: Change (u8 *) to (bool *)
v3: Introduce a new proc handler proc_dou8vec(suggested by Xinhui Pan)
v2: Change extern type in lockd.h

The test case I used:
/***/
#include 
#include 
#include 

bool __read_mostly nsm_use_hostnames;
module_param(nsm_use_hostnames, bool, 0644);

static struct ctl_table my_sysctl[] = {
 {
 .procname   = "nsm_use_hostnames",
 .data   = _use_hostnames,
 .maxlen = sizeof(int),
 .mode   = 0644,
 .proc_handler   = _dointvec,
 },
 {}
};

static struct ctl_table my_root[] = {
 {
 .procname   = "mysysctl",
 .mode   = 0555,
 .child  = my_sysctl,
 },
 {}
};

static struct ctl_table_header * my_ctl_header;

static int __init sysctl_exam_init(void)
{
 my_ctl_header = register_sysctl_table(_root);
 if (my_ctl_header == NULL)
 printk("error regiester sysctl");

 return 0;
}

static void __exit sysctl_exam_exit(void)
{
 unregister_sysctl_table(my_ctl_header);
}

module_init(sysctl_exam_init);
module_exit(sysctl_exam_exit);
MODULE_LICENSE("GPL");
//

[root@bigendian my]# insmod -f /root/my/hello.ko nsm_use_hostnames=1
[root@bigendian my]# cat /proc/sys/mysysctl/nsm_use_hostnames
16777216

After I change the proc_dointvec to new handler proc_dou8vec with the
patch:
[root@bigendian my]# insmod -f /root/my/hello.ko nsm_use_hostnames=1
[root@bigendian my]# cat /proc/sys/mysysctl/nsm_use_hostnames
1

In little endian system, there is no such issue.
Already tested in both of big and little endian(ppc64 and ppc64le)

Jia He (2):
   sysctl: introduce new proc handler proc_dobool
   lockd: change the proc_handler for nsm_use_hostnames

  fs/lockd/svc.c |  2 +-
  include/linux/sysctl.h |  2 ++
  kernel/sysctl.c| 41 +
  3 files changed, 44 insertions(+), 1 deletion(-)




Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data

2016-12-20 Thread hejianet



On 20/12/2016 8:31 PM, Mel Gorman wrote:

On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote:

In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed codes to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
  z->node would not be equal to preferred_zone->node. That seems to be
incorrect.

Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics")

Signed-off-by: Jia He 

This is slightly curious. It appear it would only occur if a process was
running on a node that was outside the memory policy. Can you confirm
that is the case?

Yes, here is what I caught:

dumpstack() is triggered when z->node(5) == preferred_zone->node(5) and z->node 
!= numa_node_id(4)
without flag GET_OTHER_NODE

It is not a rare case. I saw hundreds of such cases when kernel boots up
[c00cdcaef440] [c02e88cc] cache_grow_begin+0xcc/0x500
[c00cdcaef6f0] [c02ecb44] do_tune_cpucache+0x64/0x100
[c00cdcaef750] [c02ecc7c] enable_cpucache+0x9c/0x180
[c00cdcaef7d0] [c02ed01c] __kmem_cache_create+0x1ec/00x2c0
[c00cdcaef820] [c0291c98] create_cache+0xb8/0x240
[c00cdcaef890] [c0291fa8] kmem_cache_create+0x188/0x2290
[c00cdcaef950] [d00011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4]
[c00cdcaef9f0] [d00011daaedc] ext4_fill_super+0x266c/0x33390 [ext4]
[c00cdcaefb30] [c0328b8c] mount_bdev+0x22c/0x260
[c00cdcaefbd0] [d00011da1fa8] ext4_mount+0x48/0x60 [ext4]
[c00cdcaefc10] [c032a11c] mount_fs+0x8c/0x230
[c00cdcaefcb0] [c0351f98] vfs_kern_mount+0x78/0x180
[c00cdcaefd00] [c0356d68] do_mount+0x258/0xea0
[c00cdcaefde0] [c0357da0] SyS_mount+0xa0/0x110
[c00cdcaefe30] [c000bd84] system_call+0x38/0xe0



If so, your patch is a a semantic curiousity because it's actually
impossible for a NUMA allocation to be local and the definition of "HIT"
is fuzzy enough to be useless.

I won't object to the patch but it makes me trust "hit" even less than I
already do for any analysis.

Note that after this mail that I'll be unavailable by mail until early
new years.





Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data

2016-12-20 Thread hejianet



On 20/12/2016 8:31 PM, Mel Gorman wrote:

On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote:

In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed codes to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
  z->node would not be equal to preferred_zone->node. That seems to be
incorrect.

Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics")

Signed-off-by: Jia He 

This is slightly curious. It appear it would only occur if a process was
running on a node that was outside the memory policy. Can you confirm
that is the case?

Yes, here is what I caught:

dumpstack() is triggered when z->node(5) == preferred_zone->node(5) and z->node 
!= numa_node_id(4)
without flag GET_OTHER_NODE

It is not a rare case. I saw hundreds of such cases when kernel boots up
[c00cdcaef440] [c02e88cc] cache_grow_begin+0xcc/0x500
[c00cdcaef6f0] [c02ecb44] do_tune_cpucache+0x64/0x100
[c00cdcaef750] [c02ecc7c] enable_cpucache+0x9c/0x180
[c00cdcaef7d0] [c02ed01c] __kmem_cache_create+0x1ec/00x2c0
[c00cdcaef820] [c0291c98] create_cache+0xb8/0x240
[c00cdcaef890] [c0291fa8] kmem_cache_create+0x188/0x2290
[c00cdcaef950] [d00011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4]
[c00cdcaef9f0] [d00011daaedc] ext4_fill_super+0x266c/0x33390 [ext4]
[c00cdcaefb30] [c0328b8c] mount_bdev+0x22c/0x260
[c00cdcaefbd0] [d00011da1fa8] ext4_mount+0x48/0x60 [ext4]
[c00cdcaefc10] [c032a11c] mount_fs+0x8c/0x230
[c00cdcaefcb0] [c0351f98] vfs_kern_mount+0x78/0x180
[c00cdcaefd00] [c0356d68] do_mount+0x258/0xea0
[c00cdcaefde0] [c0357da0] SyS_mount+0xa0/0x110
[c00cdcaefe30] [c000bd84] system_call+0x38/0xe0



If so, your patch is a a semantic curiousity because it's actually
impossible for a NUMA allocation to be local and the definition of "HIT"
is fuzzy enough to be useless.

I won't object to the patch but it makes me trust "hit" even less than I
already do for any analysis.

Note that after this mail that I'll be unavailable by mail until early
new years.





Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data

2016-12-20 Thread hejianet



On 20/12/2016 5:18 PM, Michal Hocko wrote:

On Mon 12-12-16 13:59:07, Jia He wrote:

In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed codes to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
  z->node would not be equal to preferred_zone->node. That seems to be
incorrect.

I am sorry but I have hard time following the changelog. It is clear
that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
but it is not really clear when such thing happens. You are adding
preferred_zone->node check. preferred_zone is the first zone in the
requested zonelist. So for the most allocations it is a node from the
local node. But if something request an explicit numa node (without
__GFP_OTHER_NODE which would be the majority I suspect) then we could
indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
referenced patch indeed caused an unintended change of accounting AFAIU.

If this is correct then it should be a part of the changelog. I also
cannot say I would like the fix. First of all I am not sure
__GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the
flag any different from an explicit __alloc_pages_node(non_local_nid)?
In both cases we ask for an allocation on a remote node and successful
allocation is a NUMA_HIT and NUMA_OTHER.

That being said, why cannot we simply do the following? As a bonus, we
can get rid of a barely used __GFP_OTHER_NODE. Also the number of
branches will stay same.

Yes, I agree maybe we can get rid of __GFP_OTHER_NODE if no objections
Seems currently it is only used for hugepage and statistics

---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 429855be6ec9..f035d5c8b864 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int 
order)
   * Update NUMA hit/miss statistics
   *
   * Must be called with interrupts disabled.
- *
- * When __GFP_OTHER_NODE is set assume the node of the preferred
- * zone is the local node. This is useful for daemons who allocate
- * memory on behalf of other processes.
   */
  static inline void zone_statistics(struct zone *preferred_zone, struct zone 
*z,
gfp_t flags)
  {
  #ifdef CONFIG_NUMA
-   int local_nid = numa_node_id();
-   enum zone_stat_item local_stat = NUMA_LOCAL;
-
-   if (unlikely(flags & __GFP_OTHER_NODE)) {
-   local_stat = NUMA_OTHER;
-   local_nid = preferred_zone->node;
-   }
+   if (z->node == preferred_zone->node) {
+   enum zone_stat_item local_stat = NUMA_LOCAL;
  
-	if (z->node == local_nid) {

__inc_zone_state(z, NUMA_HIT);
+   if (z->node != numa_node_id())
+   local_stat = NUMA_OTHER;
__inc_zone_state(z, local_stat);
} else {
__inc_zone_state(z, NUMA_MISS);

I thought the logic here is different
Here is the zone_statistics() before introducing __GFP_OTHER_NODE:

if (z->zone_pgdat == preferred_zone->zone_pgdat) {
__inc_zone_state(z, NUMA_HIT);
} else {
__inc_zone_state(z, NUMA_MISS);
__inc_zone_state(preferred_zone, NUMA_FOREIGN);
}
if (z->node == numa_node_id())
__inc_zone_state(z, NUMA_LOCAL);
else
__inc_zone_state(z, NUMA_OTHER);

B.R.
Jia


Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data

2016-12-20 Thread hejianet



On 20/12/2016 5:18 PM, Michal Hocko wrote:

On Mon 12-12-16 13:59:07, Jia He wrote:

In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed codes to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
  z->node would not be equal to preferred_zone->node. That seems to be
incorrect.

I am sorry but I have hard time following the changelog. It is clear
that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
but it is not really clear when such thing happens. You are adding
preferred_zone->node check. preferred_zone is the first zone in the
requested zonelist. So for the most allocations it is a node from the
local node. But if something request an explicit numa node (without
__GFP_OTHER_NODE which would be the majority I suspect) then we could
indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
referenced patch indeed caused an unintended change of accounting AFAIU.

If this is correct then it should be a part of the changelog. I also
cannot say I would like the fix. First of all I am not sure
__GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the
flag any different from an explicit __alloc_pages_node(non_local_nid)?
In both cases we ask for an allocation on a remote node and successful
allocation is a NUMA_HIT and NUMA_OTHER.

That being said, why cannot we simply do the following? As a bonus, we
can get rid of a barely used __GFP_OTHER_NODE. Also the number of
branches will stay same.

Yes, I agree maybe we can get rid of __GFP_OTHER_NODE if no objections
Seems currently it is only used for hugepage and statistics

---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 429855be6ec9..f035d5c8b864 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int 
order)
   * Update NUMA hit/miss statistics
   *
   * Must be called with interrupts disabled.
- *
- * When __GFP_OTHER_NODE is set assume the node of the preferred
- * zone is the local node. This is useful for daemons who allocate
- * memory on behalf of other processes.
   */
  static inline void zone_statistics(struct zone *preferred_zone, struct zone 
*z,
gfp_t flags)
  {
  #ifdef CONFIG_NUMA
-   int local_nid = numa_node_id();
-   enum zone_stat_item local_stat = NUMA_LOCAL;
-
-   if (unlikely(flags & __GFP_OTHER_NODE)) {
-   local_stat = NUMA_OTHER;
-   local_nid = preferred_zone->node;
-   }
+   if (z->node == preferred_zone->node) {
+   enum zone_stat_item local_stat = NUMA_LOCAL;
  
-	if (z->node == local_nid) {

__inc_zone_state(z, NUMA_HIT);
+   if (z->node != numa_node_id())
+   local_stat = NUMA_OTHER;
__inc_zone_state(z, local_stat);
} else {
__inc_zone_state(z, NUMA_MISS);

I thought the logic here is different
Here is the zone_statistics() before introducing __GFP_OTHER_NODE:

if (z->zone_pgdat == preferred_zone->zone_pgdat) {
__inc_zone_state(z, NUMA_HIT);
} else {
__inc_zone_state(z, NUMA_MISS);
__inc_zone_state(preferred_zone, NUMA_FOREIGN);
}
if (z->node == numa_node_id())
__inc_zone_state(z, NUMA_LOCAL);
else
__inc_zone_state(z, NUMA_OTHER);

B.R.
Jia


Re: [RESEND PATCH v4 1/2] sysctl: introduce new proc handler proc_dobool

2016-12-14 Thread hejianet

Thanks, this error is caused by

# CONFIG_PROC_SYSCTL is not set

Will fixed in next version

Jia

B.R.


On 12/14/16 2:13 PM, kbuild test robot wrote:

Hi Jia,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Jia-He/sysctl-introduce-new-proc-handler-proc_dobool/20161214-112656
config: x86_64-randconfig-n0-12141159 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
 # save the attached .config to linux build tree
 make ARCH=x86_64

All errors (new ones prefixed by >>):


kernel/built-in.o:(___ksymtab+proc_dobool+0x0): undefined reference to 
`proc_dobool'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation




Re: [RESEND PATCH v4 1/2] sysctl: introduce new proc handler proc_dobool

2016-12-14 Thread hejianet

Thanks, this error is caused by

# CONFIG_PROC_SYSCTL is not set

Will fixed in next version

Jia

B.R.


On 12/14/16 2:13 PM, kbuild test robot wrote:

Hi Jia,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Jia-He/sysctl-introduce-new-proc-handler-proc_dobool/20161214-112656
config: x86_64-randconfig-n0-12141159 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
 # save the attached .config to linux build tree
 make ARCH=x86_64

All errors (new ones prefixed by >>):


kernel/built-in.o:(___ksymtab+proc_dobool+0x0): undefined reference to 
`proc_dobool'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation




Re: [PATCH v2 1/1] lockd: Change nsm_use_hostnames from bool to u32

2016-12-11 Thread hejianet


Hi Xinhui
Thanks, it really works.
Will send out V3 soon afterwards

B.R.
Jia

On 12/12/16 1:43 AM, Pan Xinhui wrote:


hi, jia
nice catch!
However I think we should fix it totally.
This is because do_proc_dointvec_conv() try to get a int value from a bool *.

something like below might help. pls. ignore the code style and this is tested 
:)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index fc4084e..7eeaee4 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -519,6 +519,8 @@ EXPORT_SYMBOL_GPL(lockd_down);
  * Sysctl parameters (same as module parameters, different interface).
  */

+int proc_dou8vec(struct ctl_table *table, int write,
+void __user *buffer, size_t *lenp, loff_t 
*ppos);

 static struct ctl_table nlm_sysctls[] = {
{
.procname   = "nlm_grace_period",
@@ -561,7 +563,7 @@ static struct ctl_table nlm_sysctls[] = {
.data   = _use_hostnames,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dou8vec,
},
{
.procname   = "nsm_local_state",
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 706309f..6307737 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2112,6 +2112,30 @@ static int proc_put_char(void __user **buf, size_t 
*size, char c)

return 0;
 }

+
+static int do_proc_dou8vec_conv(bool *negp, unsigned long *lvalp,
+u8 *valp,
+int write, void *data)
+{
+   if (write) {
+   if (*negp) {
+   *valp = -*lvalp;
+   } else {
+   *valp = *lvalp;
+   }
+   } else {
+   int val = *valp;
+   if (val < 0) {
+   *negp = true;
+   *lvalp = -(unsigned long)val;
+   } else {
+   *negp = false;
+   *lvalp = (unsigned long)val;
+   }
+   }
+   return 0;
+}
+
 static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
 int *valp,
 int write, void *data)
@@ -2296,6 +2320,14 @@ int proc_douintvec(struct ctl_table *table, int write,
do_proc_douintvec_conv, NULL);
 }

+int proc_dou8vec(struct ctl_table *table, int write,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   return do_proc_dointvec(table, write, buffer, lenp, ppos,
+   do_proc_dou8vec_conv, NULL);
+}
+
+

在 2016/12/11 23:36, Jia He 写道:

nsm_use_hostnames is a module paramter and it will be exported to sysctl
procfs. This is to let user sometimes change it from userspace. But the
minimal unit for sysctl procfs read/write it sizeof(int).
In big endian system, the converting from/to  bool to/from int will cause
error for proc items.

This patch changes the type definition of nsm_use_hostnames.

V2: Changes extern type in lockd.h
Signed-off-by: Jia He 
---
 fs/lockd/mon.c  | 2 +-
 fs/lockd/svc.c  | 2 +-
 include/linux/lockd/lockd.h | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index 19166d4..3e7ff4d 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -57,7 +57,7 @@ static DEFINE_SPINLOCK(nsm_lock);
  * Local NSM state
  */
 u32__read_mostlynsm_local_state;
-bool__read_mostlynsm_use_hostnames;
+u32__read_mostlynsm_use_hostnames;

 static inline struct sockaddr *nsm_addr(const struct nsm_handle *nsm)
 {
diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index fc4084e..308033d 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -658,7 +658,7 @@ module_param_call(nlm_udpport, param_set_port, 
param_get_int,

   _udpport, 0644);
 module_param_call(nlm_tcpport, param_set_port, param_get_int,
   _tcpport, 0644);
-module_param(nsm_use_hostnames, bool, 0644);
+module_param(nsm_use_hostnames, u32, 0644);
 module_param(nlm_max_connections, uint, 0644);

 static int lockd_init_net(struct net *net)
diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
index c153738..db52152 100644
--- a/include/linux/lockd/lockd.h
+++ b/include/linux/lockd/lockd.h
@@ -196,7 +196,7 @@ extern struct svc_procedure nlmsvc_procedures4[];
 #endif
 extern intnlmsvc_grace_period;
 extern unsigned longnlmsvc_timeout;
-extern boolnsm_use_hostnames;
+extern u32nsm_use_hostnames;
 extern u32nsm_local_state;

 /*








Re: [PATCH v2 1/1] lockd: Change nsm_use_hostnames from bool to u32

2016-12-11 Thread hejianet


Hi Xinhui
Thanks, it really works.
Will send out V3 soon afterwards

B.R.
Jia

On 12/12/16 1:43 AM, Pan Xinhui wrote:


hi, jia
nice catch!
However I think we should fix it totally.
This is because do_proc_dointvec_conv() try to get a int value from a bool *.

something like below might help. pls. ignore the code style and this is tested 
:)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index fc4084e..7eeaee4 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -519,6 +519,8 @@ EXPORT_SYMBOL_GPL(lockd_down);
  * Sysctl parameters (same as module parameters, different interface).
  */

+int proc_dou8vec(struct ctl_table *table, int write,
+void __user *buffer, size_t *lenp, loff_t 
*ppos);

 static struct ctl_table nlm_sysctls[] = {
{
.procname   = "nlm_grace_period",
@@ -561,7 +563,7 @@ static struct ctl_table nlm_sysctls[] = {
.data   = _use_hostnames,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dou8vec,
},
{
.procname   = "nsm_local_state",
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 706309f..6307737 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2112,6 +2112,30 @@ static int proc_put_char(void __user **buf, size_t 
*size, char c)

return 0;
 }

+
+static int do_proc_dou8vec_conv(bool *negp, unsigned long *lvalp,
+u8 *valp,
+int write, void *data)
+{
+   if (write) {
+   if (*negp) {
+   *valp = -*lvalp;
+   } else {
+   *valp = *lvalp;
+   }
+   } else {
+   int val = *valp;
+   if (val < 0) {
+   *negp = true;
+   *lvalp = -(unsigned long)val;
+   } else {
+   *negp = false;
+   *lvalp = (unsigned long)val;
+   }
+   }
+   return 0;
+}
+
 static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
 int *valp,
 int write, void *data)
@@ -2296,6 +2320,14 @@ int proc_douintvec(struct ctl_table *table, int write,
do_proc_douintvec_conv, NULL);
 }

+int proc_dou8vec(struct ctl_table *table, int write,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   return do_proc_dointvec(table, write, buffer, lenp, ppos,
+   do_proc_dou8vec_conv, NULL);
+}
+
+

在 2016/12/11 23:36, Jia He 写道:

nsm_use_hostnames is a module paramter and it will be exported to sysctl
procfs. This is to let user sometimes change it from userspace. But the
minimal unit for sysctl procfs read/write it sizeof(int).
In big endian system, the converting from/to  bool to/from int will cause
error for proc items.

This patch changes the type definition of nsm_use_hostnames.

V2: Changes extern type in lockd.h
Signed-off-by: Jia He 
---
 fs/lockd/mon.c  | 2 +-
 fs/lockd/svc.c  | 2 +-
 include/linux/lockd/lockd.h | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index 19166d4..3e7ff4d 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -57,7 +57,7 @@ static DEFINE_SPINLOCK(nsm_lock);
  * Local NSM state
  */
 u32__read_mostlynsm_local_state;
-bool__read_mostlynsm_use_hostnames;
+u32__read_mostlynsm_use_hostnames;

 static inline struct sockaddr *nsm_addr(const struct nsm_handle *nsm)
 {
diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index fc4084e..308033d 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -658,7 +658,7 @@ module_param_call(nlm_udpport, param_set_port, 
param_get_int,

   _udpport, 0644);
 module_param_call(nlm_tcpport, param_set_port, param_get_int,
   _tcpport, 0644);
-module_param(nsm_use_hostnames, bool, 0644);
+module_param(nsm_use_hostnames, u32, 0644);
 module_param(nlm_max_connections, uint, 0644);

 static int lockd_init_net(struct net *net)
diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
index c153738..db52152 100644
--- a/include/linux/lockd/lockd.h
+++ b/include/linux/lockd/lockd.h
@@ -196,7 +196,7 @@ extern struct svc_procedure nlmsvc_procedures4[];
 #endif
 extern intnlmsvc_grace_period;
 extern unsigned longnlmsvc_timeout;
-extern boolnsm_use_hostnames;
+extern u32nsm_use_hostnames;
 extern u32nsm_local_state;

 /*








Re: [PATCH v5 0/7] Reduce cache miss for snmp_fold_field

2016-09-28 Thread hejianet



On 9/28/16 5:08 PM, David Miller wrote:

From: Jia He 
Date: Wed, 28 Sep 2016 14:22:21 +0800


v5:
- order local variables from longest to shortest line

I still see many cases where this problem still exists.  Please
do not resubmit this patch series until you fix all of them.

Patch #2:

-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;

The order should be "net" then "buff64" then "i".

Sorry for my bad eyesight and quick hand :(
B.R.
Jia


+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   struct net *net = seq->private;
+   unsigned long buff[TCPUDP_MIB_MAX];

The order should be "buff", "net", then "i".

Patch #3:

@@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
const struct snmp_mib *itemlist)
  {
int i;
-   unsigned long val;
-
  ...
+   unsigned long buff[SNMP_MIB_MAX];

The order should be "buff" then "i".

@@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, 
void __percpu *mib,
  const struct snmp_mib *itemlist, size_t 
syncpoff)
  {
int i;
+   u64 buff64[SNMP_MIB_MAX];

Likewise.

I cannot be any more explicit in my request than this.





Re: [PATCH v5 0/7] Reduce cache miss for snmp_fold_field

2016-09-28 Thread hejianet



On 9/28/16 5:08 PM, David Miller wrote:

From: Jia He 
Date: Wed, 28 Sep 2016 14:22:21 +0800


v5:
- order local variables from longest to shortest line

I still see many cases where this problem still exists.  Please
do not resubmit this patch series until you fix all of them.

Patch #2:

-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;

The order should be "net" then "buff64" then "i".

Sorry for my bad eyesight and quick hand :(
B.R.
Jia


+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   struct net *net = seq->private;
+   unsigned long buff[TCPUDP_MIB_MAX];

The order should be "buff", "net", then "i".

Patch #3:

@@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
const struct snmp_mib *itemlist)
  {
int i;
-   unsigned long val;
-
  ...
+   unsigned long buff[SNMP_MIB_MAX];

The order should be "buff" then "i".

@@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, 
void __percpu *mib,
  const struct snmp_mib *itemlist, size_t 
syncpoff)
  {
int i;
+   u64 buff64[SNMP_MIB_MAX];

Likewise.

I cannot be any more explicit in my request than this.





Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-21 Thread hejianet



On 9/22/16 2:24 AM, Marcelo wrote:

On Thu, Sep 22, 2016 at 12:18:46AM +0800, hejianet wrote:

Hi Marcelo

sorry for the late, just came back from a vacation.

Hi, no problem. Hope your batteries are recharged now :-)


On 9/14/16 7:55 PM, Marcelo wrote:

Hi Jia,

On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

That's pretty much the question. Linux has the option on some archs to
run with 4Kb (4KSTACKS option), so this function alone would be using
25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
("x86_64: expand kernel stack to 16K")).

Adding static to it is not an option as it actually makes the variable
shared amongst the CPUs (and then you have concurrency issues), plus the
fact that it's always allocated, even while not in use.

Others here certainly know better than me if it's okay to make such
usage of the stach.

What about this patch instead?
It is a trade-off. I split the aggregation process into 2 parts, it will
increase the cache miss a little bit, but it can reduce the stack usage.
After this, stack usage is 672bytes
objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
0xc07f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index c6ee8a2..cc41590 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
   */
  static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
  {
-   int i;
-   unsigned long buff[LINUX_MIB_MAX];
+   int i, c;
+   unsigned long buff[LINUX_MIB_MAX/2 + 1];
 struct net *net = seq->private;

-   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));

 seq_puts(seq, "TcpExt:");
 for (i = 0; snmp4_net_list[i].name; i++)
 seq_printf(seq, " %s", snmp4_net_list[i].name);

 seq_puts(seq, "\nTcpExt:");
-   snmp_get_cpu_field_batch(buff, snmp4_net_list,
-net->mib.net_statistics);
-   for (i = 0; snmp4_net_list[i].name; i++)
+   for_each_possible_cpu(c) {
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
+   buff[i] += snmp_get_cpu_field(
+ net->mib.net_statistics,
+   c, snmp4_net_list[i].entry);
+   }
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
 seq_printf(seq, " %lu", buff[i]);

+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
+   for_each_possible_cpu(c) {
+   for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+   buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
+   net->mib.net_statistics,
+   c,
+   snmp4_net_list[i].entry);
+   }
+for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
+
 return 0;
  }

Yep, it halves the stack usage, but it doesn't look good heh

But well, you may try to post the patchset (with or without this last
change, you pick) officially and see how it goes. As you're posting as
RFC, it's not being evaluated as seriously.

Thanks for the suggestion, I will remove it in future patch version


FWIW, I tested your patches, using your test and /proc/net/snmp file on
a x86_64 box, Intel(R) Xeon(R) CPU E5-2643 v3.

Before the patches:

  Performance counter stats for './test /proc/net/snmp':

  5.225  cache-misses
 12.708.673.785  L1-dcache-loads
  1.288.450.174  L1-dcache-load-misses #   10,14% of all L1-dcache 
hits
  1.271.857.028  LLC-loads
  4.122  LLC-load-misses   #0,00% of all LL-cache 
hits

9,174936524 seconds time elapsed

After:

  Performance counter stats for './test /proc/net/snmp':

  2.865  cache-misses
 30.203.883.807  L1-dcache-loads
  1.215.774.643  L1-dcache-load-misses #4,03% of all L1-dcache 
hi

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-21 Thread hejianet



On 9/22/16 2:24 AM, Marcelo wrote:

On Thu, Sep 22, 2016 at 12:18:46AM +0800, hejianet wrote:

Hi Marcelo

sorry for the late, just came back from a vacation.

Hi, no problem. Hope your batteries are recharged now :-)


On 9/14/16 7:55 PM, Marcelo wrote:

Hi Jia,

On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

That's pretty much the question. Linux has the option on some archs to
run with 4Kb (4KSTACKS option), so this function alone would be using
25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
("x86_64: expand kernel stack to 16K")).

Adding static to it is not an option as it actually makes the variable
shared amongst the CPUs (and then you have concurrency issues), plus the
fact that it's always allocated, even while not in use.

Others here certainly know better than me if it's okay to make such
usage of the stach.

What about this patch instead?
It is a trade-off. I split the aggregation process into 2 parts, it will
increase the cache miss a little bit, but it can reduce the stack usage.
After this, stack usage is 672bytes
objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
0xc07f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index c6ee8a2..cc41590 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
   */
  static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
  {
-   int i;
-   unsigned long buff[LINUX_MIB_MAX];
+   int i, c;
+   unsigned long buff[LINUX_MIB_MAX/2 + 1];
 struct net *net = seq->private;

-   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));

 seq_puts(seq, "TcpExt:");
 for (i = 0; snmp4_net_list[i].name; i++)
 seq_printf(seq, " %s", snmp4_net_list[i].name);

 seq_puts(seq, "\nTcpExt:");
-   snmp_get_cpu_field_batch(buff, snmp4_net_list,
-net->mib.net_statistics);
-   for (i = 0; snmp4_net_list[i].name; i++)
+   for_each_possible_cpu(c) {
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
+   buff[i] += snmp_get_cpu_field(
+ net->mib.net_statistics,
+   c, snmp4_net_list[i].entry);
+   }
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
 seq_printf(seq, " %lu", buff[i]);

+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
+   for_each_possible_cpu(c) {
+   for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+   buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
+   net->mib.net_statistics,
+   c,
+   snmp4_net_list[i].entry);
+   }
+for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
+
 return 0;
  }

Yep, it halves the stack usage, but it doesn't look good heh

But well, you may try to post the patchset (with or without this last
change, you pick) officially and see how it goes. As you're posting as
RFC, it's not being evaluated as seriously.

Thanks for the suggestion, I will remove it in future patch version


FWIW, I tested your patches, using your test and /proc/net/snmp file on
a x86_64 box, Intel(R) Xeon(R) CPU E5-2643 v3.

Before the patches:

  Performance counter stats for './test /proc/net/snmp':

  5.225  cache-misses
 12.708.673.785  L1-dcache-loads
  1.288.450.174  L1-dcache-load-misses #   10,14% of all L1-dcache 
hits
  1.271.857.028  LLC-loads
  4.122  LLC-load-misses   #0,00% of all LL-cache 
hits

9,174936524 seconds time elapsed

After:

  Performance counter stats for './test /proc/net/snmp':

  2.865  cache-misses
 30.203.883.807  L1-dcache-loads
  1.215.774.643  L1-dcache-load-misses #4,03% of all L1-dcache 
hi

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-21 Thread hejianet

Hi Marcelo

sorry for the late, just came back from a vacation.

On 9/14/16 7:55 PM, Marcelo wrote:

Hi Jia,

On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

That's pretty much the question. Linux has the option on some archs to
run with 4Kb (4KSTACKS option), so this function alone would be using
25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
("x86_64: expand kernel stack to 16K")).

Adding static to it is not an option as it actually makes the variable
shared amongst the CPUs (and then you have concurrency issues), plus the
fact that it's always allocated, even while not in use.

Others here certainly know better than me if it's okay to make such
usage of the stach.

What about this patch instead?
It is a trade-off. I split the aggregation process into 2 parts, it will
increase the cache miss a little bit, but it can reduce the stack usage.
After this, stack usage is 672bytes
objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
0xc07f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index c6ee8a2..cc41590 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
  */
 static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
 {
-   int i;
-   unsigned long buff[LINUX_MIB_MAX];
+   int i, c;
+   unsigned long buff[LINUX_MIB_MAX/2 + 1];
struct net *net = seq->private;

-   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));

seq_puts(seq, "TcpExt:");
for (i = 0; snmp4_net_list[i].name; i++)
seq_printf(seq, " %s", snmp4_net_list[i].name);

seq_puts(seq, "\nTcpExt:");
-   snmp_get_cpu_field_batch(buff, snmp4_net_list,
-net->mib.net_statistics);
-   for (i = 0; snmp4_net_list[i].name; i++)
+   for_each_possible_cpu(c) {
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
+   buff[i] += snmp_get_cpu_field(
+ net->mib.net_statistics,
+   c, snmp4_net_list[i].entry);
+   }
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
seq_printf(seq, " %lu", buff[i]);

+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
+   for_each_possible_cpu(c) {
+   for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+   buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
+   net->mib.net_statistics,
+   c,
+   snmp4_net_list[i].entry);
+   }
+for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
+
return 0;
 }


+static int netstat_seq_show_ipext(struct seq_file *seq, void *v)
+{
+   int i;
+   u64 buff64[IPSTATS_MIB_MAX];
+   struct net *net = seq->private;
seq_puts(seq, "\nIpExt:");
for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipextstats_list[i].name);
seq_puts(seq, "\nIpExt:");

You're missing a memset() call here.

Not sure if you missed this one or not..

indeed, thanks
B.R.
Jia

Thanks,
Marcelo





Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-21 Thread hejianet

Hi Marcelo

sorry for the late, just came back from a vacation.

On 9/14/16 7:55 PM, Marcelo wrote:

Hi Jia,

On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

That's pretty much the question. Linux has the option on some archs to
run with 4Kb (4KSTACKS option), so this function alone would be using
25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
("x86_64: expand kernel stack to 16K")).

Adding static to it is not an option as it actually makes the variable
shared amongst the CPUs (and then you have concurrency issues), plus the
fact that it's always allocated, even while not in use.

Others here certainly know better than me if it's okay to make such
usage of the stach.

What about this patch instead?
It is a trade-off. I split the aggregation process into 2 parts, it will
increase the cache miss a little bit, but it can reduce the stack usage.
After this, stack usage is 672bytes
objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
0xc07f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index c6ee8a2..cc41590 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
  */
 static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
 {
-   int i;
-   unsigned long buff[LINUX_MIB_MAX];
+   int i, c;
+   unsigned long buff[LINUX_MIB_MAX/2 + 1];
struct net *net = seq->private;

-   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));

seq_puts(seq, "TcpExt:");
for (i = 0; snmp4_net_list[i].name; i++)
seq_printf(seq, " %s", snmp4_net_list[i].name);

seq_puts(seq, "\nTcpExt:");
-   snmp_get_cpu_field_batch(buff, snmp4_net_list,
-net->mib.net_statistics);
-   for (i = 0; snmp4_net_list[i].name; i++)
+   for_each_possible_cpu(c) {
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
+   buff[i] += snmp_get_cpu_field(
+ net->mib.net_statistics,
+   c, snmp4_net_list[i].entry);
+   }
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
seq_printf(seq, " %lu", buff[i]);

+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
+   for_each_possible_cpu(c) {
+   for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+   buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
+   net->mib.net_statistics,
+   c,
+   snmp4_net_list[i].entry);
+   }
+for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
+
return 0;
 }


+static int netstat_seq_show_ipext(struct seq_file *seq, void *v)
+{
+   int i;
+   u64 buff64[IPSTATS_MIB_MAX];
+   struct net *net = seq->private;
seq_puts(seq, "\nIpExt:");
for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipextstats_list[i].name);
seq_puts(seq, "\nIpExt:");

You're missing a memset() call here.

Not sure if you missed this one or not..

indeed, thanks
B.R.
Jia

Thanks,
Marcelo





Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-13 Thread hejianet

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

Signed-off-by: Jia He 
---
  net/ipv4/proc.c | 106 +++-
  1 file changed, 74 insertions(+), 32 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..c6fc80e 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,8 @@
  #include 
  #include 
  
+#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)

+
  /*
   *Report socket allocation statistics [m...@utu.fi]
   */
@@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
  /*
   *Called from the PROCfs module. This outputs /proc/net/snmp.
   */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
  
-	seq_puts(seq, "Ip: Forwarding DefaultTTL");

+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
  
+	seq_puts(seq, "Ip: Forwarding DefaultTTL");

for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
  
@@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)

   net->ipv4.sysctl_ip_default_ttl);
  
  	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);

+   snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
+  net->mib.ip_statistics,
+  offsetof(struct ipstats_mib, syncp));
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
  
-	icmp_put(seq);	/* RFC 2011 compatibility */

-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	seq_puts(seq, "\nTcp:");

for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
  
  	seq_puts(seq, "\nTcp:");

+   snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
+net->mib.tcp_statistics);
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
  
+	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));

+
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udp_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	/* the UDP and UDP-Lite MIBs are the same */

seq_puts(seq, "\nUdpLite:");
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udplite_statistics);
for (i = 0; snmp4_udp_list[i].name != 

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-13 Thread hejianet

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

Signed-off-by: Jia He 
---
  net/ipv4/proc.c | 106 +++-
  1 file changed, 74 insertions(+), 32 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..c6fc80e 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,8 @@
  #include 
  #include 
  
+#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)

+
  /*
   *Report socket allocation statistics [m...@utu.fi]
   */
@@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
  /*
   *Called from the PROCfs module. This outputs /proc/net/snmp.
   */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
  
-	seq_puts(seq, "Ip: Forwarding DefaultTTL");

+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
  
+	seq_puts(seq, "Ip: Forwarding DefaultTTL");

for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
  
@@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)

   net->ipv4.sysctl_ip_default_ttl);
  
  	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);

+   snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
+  net->mib.ip_statistics,
+  offsetof(struct ipstats_mib, syncp));
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
  
-	icmp_put(seq);	/* RFC 2011 compatibility */

-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	seq_puts(seq, "\nTcp:");

for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
  
  	seq_puts(seq, "\nTcp:");

+   snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
+net->mib.tcp_statistics);
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
  
+	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));

+
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udp_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	/* the UDP and UDP-Lite MIBs are the same */

seq_puts(seq, "\nUdpLite:");
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udplite_statistics);
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
  

Re: [RFC PATCH v3 3/7] proc: Reduce cache miss in snmp6_seq_show

2016-09-13 Thread hejianet



On 9/13/16 3:05 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:58PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
  net/ipv6/proc.c | 32 +++-
  1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 679253d0..50ba2c3 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -30,6 +30,11 @@
  #include 
  #include 
  
+#define MAX4(a, b, c, d) \

+   max_t(u32, max_t(u32, a, b), max_t(u32, c, d))
+#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
+   IPSTATS_MIB_MAX, ICMP_MIB_MAX)
+
  static int sockstat6_seq_show(struct seq_file *seq, void *v)
  {
struct net *net = seq->private;
@@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
const struct snmp_mib *itemlist)
  {
int i;
-   unsigned long val;
-
-   for (i = 0; itemlist[i].name; i++) {
-   val = pcpumib ?
-   snmp_fold_field(pcpumib, itemlist[i].entry) :
-   atomic_long_read(smib + itemlist[i].entry);
-   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
+   unsigned long buff[SNMP_MIB_MAX];
+
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);

This memset() could be moved...


+
+   if (pcpumib) {

... here, so it's not executed if it hits the else block.

Thanks for the suggestion
B.R.
Jia

+   snmp_get_cpu_field_batch(buff, itemlist, pcpumib);
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n",
+  itemlist[i].name, buff[i]);
+   } else {
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
+  atomic_long_read(smib + itemlist[i].entry));
}
  }
  
@@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib,

  const struct snmp_mib *itemlist, size_t 
syncpoff)
  {
int i;
+   u64 buff64[SNMP_MIB_MAX];
+
+   memset(buff64, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
  
+	snmp_get_cpu_field64_batch(buff64, itemlist, mib, syncpoff);

for (i = 0; itemlist[i].name; i++)
-   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
-  snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
+   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff64[i]);
  }
  
  static int snmp6_seq_show(struct seq_file *seq, void *v)

--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





Re: [RFC PATCH v3 3/7] proc: Reduce cache miss in snmp6_seq_show

2016-09-13 Thread hejianet



On 9/13/16 3:05 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:58PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
  net/ipv6/proc.c | 32 +++-
  1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 679253d0..50ba2c3 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -30,6 +30,11 @@
  #include 
  #include 
  
+#define MAX4(a, b, c, d) \

+   max_t(u32, max_t(u32, a, b), max_t(u32, c, d))
+#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
+   IPSTATS_MIB_MAX, ICMP_MIB_MAX)
+
  static int sockstat6_seq_show(struct seq_file *seq, void *v)
  {
struct net *net = seq->private;
@@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
const struct snmp_mib *itemlist)
  {
int i;
-   unsigned long val;
-
-   for (i = 0; itemlist[i].name; i++) {
-   val = pcpumib ?
-   snmp_fold_field(pcpumib, itemlist[i].entry) :
-   atomic_long_read(smib + itemlist[i].entry);
-   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
+   unsigned long buff[SNMP_MIB_MAX];
+
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);

This memset() could be moved...


+
+   if (pcpumib) {

... here, so it's not executed if it hits the else block.

Thanks for the suggestion
B.R.
Jia

+   snmp_get_cpu_field_batch(buff, itemlist, pcpumib);
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n",
+  itemlist[i].name, buff[i]);
+   } else {
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
+  atomic_long_read(smib + itemlist[i].entry));
}
  }
  
@@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib,

  const struct snmp_mib *itemlist, size_t 
syncpoff)
  {
int i;
+   u64 buff64[SNMP_MIB_MAX];
+
+   memset(buff64, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
  
+	snmp_get_cpu_field64_batch(buff64, itemlist, mib, syncpoff);

for (i = 0; itemlist[i].name; i++)
-   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
-  snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
+   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff64[i]);
  }
  
  static int snmp6_seq_show(struct seq_file *seq, void *v)

--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-13 Thread hejianet

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

I found scripts/checkstack.pl could analyze the stack usage statically.
[root@tian-lp1 kernel]# objdump -d vmlinux | scripts/checkstack.pl ppc64|grep 
seq
0xc07d4b18 netstat_seq_show_tcpext.isra.7 [vmlinux]:1120
0xc07ccbe8 fib_triestat_seq_show [vmlinux]: 496
0xc083e7a4 tcp6_seq_show [vmlinux]: 480
0xc07d4908 snmp_seq_show_ipstats.isra.6 [vmlinux]:464
0xc07d4d18 netstat_seq_show_ipext.isra.8 [vmlinux]:464
0xc06f5bd8 proto_seq_show [vmlinux]:416
0xc07f5718 xfrm_statistics_seq_show [vmlinux]:  416
0xc07405b4 dev_seq_printf_stats [vmlinux]:  400

seems the stack usage in netstat_seq_show_tcpext is too big.
Will consider how to reduce it

B.R.
Jia

Signed-off-by: Jia He 
---
  net/ipv4/proc.c | 106 +++-
  1 file changed, 74 insertions(+), 32 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..c6fc80e 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,8 @@
  #include 
  #include 
  
+#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)

+
  /*
   *Report socket allocation statistics [m...@utu.fi]
   */
@@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
  /*
   *Called from the PROCfs module. This outputs /proc/net/snmp.
   */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
  
-	seq_puts(seq, "Ip: Forwarding DefaultTTL");

+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
  
+	seq_puts(seq, "Ip: Forwarding DefaultTTL");

for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
  
@@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)

   net->ipv4.sysctl_ip_default_ttl);
  
  	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);

+   snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
+  net->mib.ip_statistics,
+  offsetof(struct ipstats_mib, syncp));
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
  
-	icmp_put(seq);	/* RFC 2011 compatibility */

-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	seq_puts(seq, "\nTcp:");

for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
  
  	seq_puts(seq, "\nTcp:");

+   snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
+net->mib.tcp_statistics);
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
  
+	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));

+
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-   

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-13 Thread hejianet

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

I found scripts/checkstack.pl could analyze the stack usage statically.
[root@tian-lp1 kernel]# objdump -d vmlinux | scripts/checkstack.pl ppc64|grep 
seq
0xc07d4b18 netstat_seq_show_tcpext.isra.7 [vmlinux]:1120
0xc07ccbe8 fib_triestat_seq_show [vmlinux]: 496
0xc083e7a4 tcp6_seq_show [vmlinux]: 480
0xc07d4908 snmp_seq_show_ipstats.isra.6 [vmlinux]:464
0xc07d4d18 netstat_seq_show_ipext.isra.8 [vmlinux]:464
0xc06f5bd8 proto_seq_show [vmlinux]:416
0xc07f5718 xfrm_statistics_seq_show [vmlinux]:  416
0xc07405b4 dev_seq_printf_stats [vmlinux]:  400

seems the stack usage in netstat_seq_show_tcpext is too big.
Will consider how to reduce it

B.R.
Jia

Signed-off-by: Jia He 
---
  net/ipv4/proc.c | 106 +++-
  1 file changed, 74 insertions(+), 32 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..c6fc80e 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,8 @@
  #include 
  #include 
  
+#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)

+
  /*
   *Report socket allocation statistics [m...@utu.fi]
   */
@@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
  /*
   *Called from the PROCfs module. This outputs /proc/net/snmp.
   */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
  
-	seq_puts(seq, "Ip: Forwarding DefaultTTL");

+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
  
+	seq_puts(seq, "Ip: Forwarding DefaultTTL");

for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
  
@@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)

   net->ipv4.sysctl_ip_default_ttl);
  
  	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);

+   snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
+  net->mib.ip_statistics,
+  offsetof(struct ipstats_mib, syncp));
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
  
-	icmp_put(seq);	/* RFC 2011 compatibility */

-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	seq_puts(seq, "\nTcp:");

for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
  
  	seq_puts(seq, "\nTcp:");

+   snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
+net->mib.tcp_statistics);
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
  
+	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));

+
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  

Re: [RFC PATCH v2 0/6] Reduce cache miss for snmp_fold_field

2016-09-06 Thread hejianet

Hi Marcelo

Thanks for the suggestion

Will consider that

B.R.

Jia


On 9/6/16 8:44 PM, Marcelo Ricardo Leitner wrote:

On Tue, Sep 06, 2016 at 10:30:03AM +0800, Jia He wrote:
...

v2:
- 1/6 fix bug in udplite statistics.
- 1/6 snmp_seq_show is split into 2 parts

Jia He (6):
   proc: Reduce cache miss in {snmp,netstat}_seq_show
   proc: Reduce cache miss in snmp6_seq_show
   proc: Reduce cache miss in sctp_snmp_seq_show
   proc: Reduce cache miss in xfrm_statistics_seq_show
   ipv6: Remove useless parameter in __snmp6_fill_statsdev
   net: Suppress the "Comparison to NULL could be written" warning

Hi Jia,

Did you try to come up with a generic interface for this, like
snmp_fold_fields64() (note the fieldS) or snmp_fold_field64_batch() ?

Sounds like we have the same code in several places and seems they all
operate very similarly. They have a percpu table, an identified max, a
destination buffer..

If this is possible, this would reduce the possibility of hiccups in a
particular code.

   Marcelo






Re: [RFC PATCH v2 0/6] Reduce cache miss for snmp_fold_field

2016-09-06 Thread hejianet

Hi Marcelo

Thanks for the suggestion

Will consider that

B.R.

Jia


On 9/6/16 8:44 PM, Marcelo Ricardo Leitner wrote:

On Tue, Sep 06, 2016 at 10:30:03AM +0800, Jia He wrote:
...

v2:
- 1/6 fix bug in udplite statistics.
- 1/6 snmp_seq_show is split into 2 parts

Jia He (6):
   proc: Reduce cache miss in {snmp,netstat}_seq_show
   proc: Reduce cache miss in snmp6_seq_show
   proc: Reduce cache miss in sctp_snmp_seq_show
   proc: Reduce cache miss in xfrm_statistics_seq_show
   ipv6: Remove useless parameter in __snmp6_fill_statsdev
   net: Suppress the "Comparison to NULL could be written" warning

Hi Jia,

Did you try to come up with a generic interface for this, like
snmp_fold_fields64() (note the fieldS) or snmp_fold_field64_batch() ?

Sounds like we have the same code in several places and seems they all
operate very similarly. They have a percpu table, an identified max, a
destination buffer..

If this is possible, this would reduce the possibility of hiccups in a
particular code.

   Marcelo






Re: [RFC PATCH v2 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-06 Thread hejianet



On 9/7/16 6:57 AM, David Miller wrote:

From: Jia He 
Date: Tue,  6 Sep 2016 10:30:04 +0800


+#define MAX(a, b) ((u32)(a) >= (u32)(b) ? (a) : (b))

Thanks
B.R.
Jia


Please do not define private min/max macros, use the existing max_t()
or similar as needed.





Re: [RFC PATCH v2 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-06 Thread hejianet



On 9/7/16 6:57 AM, David Miller wrote:

From: Jia He 
Date: Tue,  6 Sep 2016 10:30:04 +0800


+#define MAX(a, b) ((u32)(a) >= (u32)(b) ? (a) : (b))

Thanks
B.R.
Jia


Please do not define private min/max macros, use the existing max_t()
or similar as needed.





Re: [RFC PATCH 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-08-30 Thread hejianet

Hi Eric


On 8/30/16 12:41 AM, Eric Dumazet wrote:

On Tue, 2016-08-30 at 00:03 +0800, Jia He wrote:

This patch exchanges the two loop for collecting the percpu statistics
data. This can aggregate the data by going through all the items of each
cpu sequentially. In snmp_seq_show, just use one buff copy to dislay the
Udp and UdpLite data because they are the same.

This is obviously not true.

On my laptop it seems it handled no UdpLite frames, but got plenty of
Udp ones.

$ grep Udp /proc/net/snmp
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors 
InCsumErrors
Udp: 1555889 108318 0 3740780 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors 
InCsumErrors
UdpLite: 0 0 0 0 0 0 0

Thanks, you are right. I misunderstand the comments of source codes.
Will resend it

B.R.
Jia




.





Re: [RFC PATCH 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-08-30 Thread hejianet

Hi Eric


On 8/30/16 12:41 AM, Eric Dumazet wrote:

On Tue, 2016-08-30 at 00:03 +0800, Jia He wrote:

This patch exchanges the two loop for collecting the percpu statistics
data. This can aggregate the data by going through all the items of each
cpu sequentially. In snmp_seq_show, just use one buff copy to dislay the
Udp and UdpLite data because they are the same.

This is obviously not true.

On my laptop it seems it handled no UdpLite frames, but got plenty of
Udp ones.

$ grep Udp /proc/net/snmp
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors 
InCsumErrors
Udp: 1555889 108318 0 3740780 0 0 0
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors 
InCsumErrors
UdpLite: 0 0 0 0 0 0 0

Thanks, you are right. I misunderstand the comments of source codes.
Will resend it

B.R.
Jia




.





Re: [RFC PATCH 1/3] net: Remove unnecessary memset in __snmp6_fill_stats64

2016-08-08 Thread hejianet

Yes, sorry about it,I am too hasty

B.R.

Jia He

On 8/8/16 7:12 PM, Florian Westphal wrote:

Jia He  wrote:

buff[] will be assigned later, so memset is not necessary.

Signed-off-by: Jia He 
Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: James Morris 
Cc: Hideaki YOSHIFUJI 
Cc: Patrick McHardy 
---
  net/ipv6/addrconf.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ab3e796..43fa8d0 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4967,7 +4967,6 @@ static inline void __snmp6_fill_stats64(u64 *stats, void 
__percpu *mib,
  
  	BUG_ON(pad < 0);
  
-	memset(buff, 0, sizeof(buff));

buff[0] = IPSTATS_MIB_MAX;
  
  	for_each_possible_cpu(c) {

 for (i = 1; i < IPSTATS_MIB_MAX; i++)
 buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);

Without memset result of buff[i] += ... is undefined.





Re: [RFC PATCH 1/3] net: Remove unnecessary memset in __snmp6_fill_stats64

2016-08-08 Thread hejianet

Yes, sorry about it,I am too hasty

B.R.

Jia He

On 8/8/16 7:12 PM, Florian Westphal wrote:

Jia He  wrote:

buff[] will be assigned later, so memset is not necessary.

Signed-off-by: Jia He 
Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: James Morris 
Cc: Hideaki YOSHIFUJI 
Cc: Patrick McHardy 
---
  net/ipv6/addrconf.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ab3e796..43fa8d0 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4967,7 +4967,6 @@ static inline void __snmp6_fill_stats64(u64 *stats, void 
__percpu *mib,
  
  	BUG_ON(pad < 0);
  
-	memset(buff, 0, sizeof(buff));

buff[0] = IPSTATS_MIB_MAX;
  
  	for_each_possible_cpu(c) {

 for (i = 1; i < IPSTATS_MIB_MAX; i++)
 buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);

Without memset result of buff[i] += ... is undefined.





Re: [RFC PATCH] mm/hugetlb: Avoid soft lockup in set_max_huge_pages()

2016-07-26 Thread hejianet

Hi Dave

On 7/26/16 11:58 PM, Dave Hansen wrote:

On 07/26/2016 08:44 AM, Jia He wrote:

This patch is to fix such soft lockup. I thouhgt it is safe to call
cond_resched() because alloc_fresh_gigantic_page and alloc_fresh_huge_page
are out of spin_lock/unlock section.

Yikes.  So the call site for both the things you patch is this:


 while (count > persistent_huge_pages(h)) {

...

 spin_unlock(_lock);
 if (hstate_is_gigantic(h))
 ret = alloc_fresh_gigantic_page(h, nodes_allowed);
 else
 ret = alloc_fresh_huge_page(h, nodes_allowed);
 spin_lock(_lock);

and you choose to patch both of the alloc_*() functions.  Why not just
fix it at the common call site?  Seems like that
spin_lock(_lock) could be a cond_resched_lock() which would fix
both cases.

I agree to move the cond_resched() to a common site in set_max_huge_pages().
But do you mean the spin_lock in this while loop can be replaced by
cond_resched_lock?
IIUC, cond_resched_lock = spin_unlock+cond_resched+spin_lock.
So could you please explain more details about it? Thanks.

B.R.
Justin

Also, putting that cond_resched() inside the for_each_node*() loop is an
odd choice.  It seems to indicate that the loops can take a long time,
which really isn't the case.  The _loop_ isn't long, right?





Re: [RFC PATCH] mm/hugetlb: Avoid soft lockup in set_max_huge_pages()

2016-07-26 Thread hejianet

Hi Dave

On 7/26/16 11:58 PM, Dave Hansen wrote:

On 07/26/2016 08:44 AM, Jia He wrote:

This patch is to fix such soft lockup. I thouhgt it is safe to call
cond_resched() because alloc_fresh_gigantic_page and alloc_fresh_huge_page
are out of spin_lock/unlock section.

Yikes.  So the call site for both the things you patch is this:


 while (count > persistent_huge_pages(h)) {

...

 spin_unlock(_lock);
 if (hstate_is_gigantic(h))
 ret = alloc_fresh_gigantic_page(h, nodes_allowed);
 else
 ret = alloc_fresh_huge_page(h, nodes_allowed);
 spin_lock(_lock);

and you choose to patch both of the alloc_*() functions.  Why not just
fix it at the common call site?  Seems like that
spin_lock(_lock) could be a cond_resched_lock() which would fix
both cases.

I agree to move the cond_resched() to a common site in set_max_huge_pages().
But do you mean the spin_lock in this while loop can be replaced by
cond_resched_lock?
IIUC, cond_resched_lock = spin_unlock+cond_resched+spin_lock.
So could you please explain more details about it? Thanks.

B.R.
Justin

Also, putting that cond_resched() inside the for_each_node*() loop is an
odd choice.  It seems to indicate that the loops can take a long time,
which really isn't the case.  The _loop_ isn't long, right?





Re: [RFC PATCH] mm/hugetlb: Avoid soft lockup in set_max_huge_pages()

2016-07-26 Thread hejianet



On 7/26/16 11:58 PM, Dave Hansen wrote:

On 07/26/2016 08:44 AM, Jia He wrote:

This patch is to fix such soft lockup. I thouhgt it is safe to call
cond_resched() because alloc_fresh_gigantic_page and alloc_fresh_huge_page
are out of spin_lock/unlock section.

Yikes.  So the call site for both the things you patch is this:


 while (count > persistent_huge_pages(h)) {

...

 spin_unlock(_lock);
 if (hstate_is_gigantic(h))
 ret = alloc_fresh_gigantic_page(h, nodes_allowed);
 else
 ret = alloc_fresh_huge_page(h, nodes_allowed);
 spin_lock(_lock);

and you choose to patch both of the alloc_*() functions.  Why not just
fix it at the common call site?  Seems like that
spin_lock(_lock) could be a cond_resched_lock() which would fix
both cases.

Also, putting that cond_resched() inside the for_each_node*() loop is an
odd choice.  It seems to indicate that the loops can take a long time,
which really isn't the case.  The _loop_ isn't long, right?

Yes,thanks for the suggestions
Will send out V2 later

B.R.



Re: [RFC PATCH] mm/hugetlb: Avoid soft lockup in set_max_huge_pages()

2016-07-26 Thread hejianet



On 7/26/16 11:58 PM, Dave Hansen wrote:

On 07/26/2016 08:44 AM, Jia He wrote:

This patch is to fix such soft lockup. I thouhgt it is safe to call
cond_resched() because alloc_fresh_gigantic_page and alloc_fresh_huge_page
are out of spin_lock/unlock section.

Yikes.  So the call site for both the things you patch is this:


 while (count > persistent_huge_pages(h)) {

...

 spin_unlock(_lock);
 if (hstate_is_gigantic(h))
 ret = alloc_fresh_gigantic_page(h, nodes_allowed);
 else
 ret = alloc_fresh_huge_page(h, nodes_allowed);
 spin_lock(_lock);

and you choose to patch both of the alloc_*() functions.  Why not just
fix it at the common call site?  Seems like that
spin_lock(_lock) could be a cond_resched_lock() which would fix
both cases.

Also, putting that cond_resched() inside the for_each_node*() loop is an
odd choice.  It seems to indicate that the loops can take a long time,
which really isn't the case.  The _loop_ isn't long, right?

Yes,thanks for the suggestions
Will send out V2 later

B.R.



Re: [PATCH v2 2/3] lib: Introduce 2 bit ops api: all_is_bit_{one,zero}

2015-11-19 Thread hejianet

Thanks, I will add it in next verison
B.R.
Justin

在 11/19/15 4:40 PM, xinhui 写道:

hi, jia
Nice patch. But I have one minor question. see inline comments.

On 2015/11/19 14:48, Jia He wrote:

This patch introduces 2 lightweight bit api.
all_bit_is_zero return 1 if the bit string is all zero.
The addr is the start address, the size is the bit size of the bit 
string.

all_bit_is_one is the opposite.

Signed-off-by: Jia He 
---
  lib/find_bit.c | 50 ++
  1 file changed, 50 insertions(+)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..1d56d8d 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -131,6 +131,56 @@ unsigned long find_last_bit(const unsigned long 
*addr, unsigned long size)

  EXPORT_SYMBOL(find_last_bit);
  #endif

+#ifndef all_bit_is_zero
+/*
+ * return val: 1 means all bit is zero
+ */
+unsigned int all_bit_is_zero(const unsigned long *addr, unsigned size)
+{
 Seems better that size should be type of "unsigned long". Otherwise 
I'm afraid when we compare idx * BITS_PER_LONG with size, there might 
be overflow issue.



+unsigned long idx;
+unsigned long mask = size;
+
+if (unlikely(size == 0))
+return 1;
+
+if (size > BITS_PER_LONG) {
+for (idx = 0; idx * BITS_PER_LONG < size; idx++)
+if (addr[idx])
+return 0;
+
+mask = size - (idx - 1) * BITS_PER_LONG;
+}
+
+return !(*addr & BITMAP_LAST_WORD_MASK(mask));
+}
+EXPORT_SYMBOL(all_bit_is_zero);
+#endif
+
+#ifndef all_bit_is_one
+/*
+ * return val: 1 means all bit is one
+ */
+unsigned int all_bit_is_one(const unsigned long *addr, unsigned size)
+{

this argc of size should be type of "unsigned long", too.

thanks
xinhui


+unsigned long idx;
+unsigned long mask = size;
+
+if (unlikely(size == 0))
+return 1;
+
+if (size > BITS_PER_LONG) {
+for (idx = 0; idx * BITS_PER_LONG < size; idx++)
+if (~addr[idx])
+return 0;
+
+mask = size - (idx - 1) * BITS_PER_LONG;
+}
+
+return !(~(*addr) & BITMAP_LAST_WORD_MASK(mask));
+}
+EXPORT_SYMBOL(all_bit_is_one);
+#endif
+
  #ifdef __BIG_ENDIAN

  /* include/linux/byteorder does not support "unsigned long" type */



--
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] lib: Introduce 2 bit ops api: all_is_bit_{one,zero}

2015-11-19 Thread hejianet

Thanks, I will add it in next verison
B.R.
Justin

在 11/19/15 4:40 PM, xinhui 写道:

hi, jia
Nice patch. But I have one minor question. see inline comments.

On 2015/11/19 14:48, Jia He wrote:

This patch introduces 2 lightweight bit api.
all_bit_is_zero return 1 if the bit string is all zero.
The addr is the start address, the size is the bit size of the bit 
string.

all_bit_is_one is the opposite.

Signed-off-by: Jia He 
---
  lib/find_bit.c | 50 ++
  1 file changed, 50 insertions(+)

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18072ea..1d56d8d 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -131,6 +131,56 @@ unsigned long find_last_bit(const unsigned long 
*addr, unsigned long size)

  EXPORT_SYMBOL(find_last_bit);
  #endif

+#ifndef all_bit_is_zero
+/*
+ * return val: 1 means all bit is zero
+ */
+unsigned int all_bit_is_zero(const unsigned long *addr, unsigned size)
+{
 Seems better that size should be type of "unsigned long". Otherwise 
I'm afraid when we compare idx * BITS_PER_LONG with size, there might 
be overflow issue.



+unsigned long idx;
+unsigned long mask = size;
+
+if (unlikely(size == 0))
+return 1;
+
+if (size > BITS_PER_LONG) {
+for (idx = 0; idx * BITS_PER_LONG < size; idx++)
+if (addr[idx])
+return 0;
+
+mask = size - (idx - 1) * BITS_PER_LONG;
+}
+
+return !(*addr & BITMAP_LAST_WORD_MASK(mask));
+}
+EXPORT_SYMBOL(all_bit_is_zero);
+#endif
+
+#ifndef all_bit_is_one
+/*
+ * return val: 1 means all bit is one
+ */
+unsigned int all_bit_is_one(const unsigned long *addr, unsigned size)
+{

this argc of size should be type of "unsigned long", too.

thanks
xinhui


+unsigned long idx;
+unsigned long mask = size;
+
+if (unlikely(size == 0))
+return 1;
+
+if (size > BITS_PER_LONG) {
+for (idx = 0; idx * BITS_PER_LONG < size; idx++)
+if (~addr[idx])
+return 0;
+
+mask = size - (idx - 1) * BITS_PER_LONG;
+}
+
+return !(~(*addr) & BITMAP_LAST_WORD_MASK(mask));
+}
+EXPORT_SYMBOL(all_bit_is_one);
+#endif
+
  #ifdef __BIG_ENDIAN

  /* include/linux/byteorder does not support "unsigned long" type */



--
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] linux/bitmap: Replace find_fisrt_{zero_}bit with the new lightweight api

2015-11-18 Thread hejianet
Thanks, I only compiled and tested in x86_64, will check what's wrong in 
m68k


B.R.
Justin


在 11/19/15 10:53 AM, kbuild test robot 写道:

Hi Jia,

[auto build test ERROR on: v4.4-rc1]
[also build test ERROR on: next-20151118]

url:
https://github.com/0day-ci/linux/commits/Jia-He/Improve-bitmap_empty-and-bitmap_full/20151119-103554
config: m68k-allyesconfig (attached as .config)
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=m68k

All errors (new ones prefixed by >>):

kernel/time/Kconfig:155:warning: range is invalid
warning: (USB_OTG_FSM && FSL_USB2_OTG && USB_MV_OTG) selects USB_OTG which has unmet direct 
dependencies (USB_SUPPORT && USB && PM)
warning: (SINGLE_MEMORY_CHUNK) selects NEED_MULTIPLE_NODES which has unmet 
direct dependencies (DISCONTIGMEM || NUMA)
warning: (PREEMPT && DEBUG_ATOMIC_SLEEP) selects PREEMPT_COUNT which has 
unmet direct dependencies (COLDFIRE)
warning: (USB_OTG_FSM && FSL_USB2_OTG && USB_MV_OTG) selects USB_OTG which has unmet direct 
dependencies (USB_SUPPORT && USB && PM)
warning: (SINGLE_MEMORY_CHUNK) selects NEED_MULTIPLE_NODES which has unmet 
direct dependencies (DISCONTIGMEM || NUMA)
warning: (PREEMPT && DEBUG_ATOMIC_SLEEP) selects PREEMPT_COUNT which has 
unmet direct dependencies (COLDFIRE)
In file included from include/linux/cpumask.h:11:0,
 from include/linux/rcupdate.h:40,
 from include/linux/rbtree.h:34,
 from include/linux/sched.h:22,
 from arch/m68k/kernel/asm-offsets.c:14:
include/linux/bitmap.h: In function 'bitmap_empty':

include/linux/bitmap.h:284:2: error: implicit declaration of function 
'all_bit_is_zero' [-Werror=implicit-function-declaration]

  return all_bit_is_zero(src, nbits);
  ^
include/linux/bitmap.h: In function 'bitmap_full':

include/linux/bitmap.h:292:2: error: implicit declaration of function 
'all_bit_is_one' [-Werror=implicit-function-declaration]

  return all_bit_is_one(src, nbits);
  ^
cc1: some warnings being treated as errors
make[2]: *** [arch/m68k/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [sub-make] Error 2

vim +/all_bit_is_zero +284 include/linux/bitmap.h

278 
279 static inline int bitmap_empty(const unsigned long *src, unsigned nbits)
280 {
281 if (small_const_nbits(nbits))
282 return ! (*src & BITMAP_LAST_WORD_MASK(nbits));
283 
  > 284  return all_bit_is_zero(src, nbits);
285 }
286 
287 static inline int bitmap_full(const unsigned long *src, unsigned int 
nbits)
288 {
289 if (small_const_nbits(nbits))
290 return ! (~(*src) & BITMAP_LAST_WORD_MASK(nbits));
291 
  > 292  return all_bit_is_one(src, nbits);
293 }
294 
295 static __always_inline int bitmap_weight(const unsigned long *src, 
unsigned int nbits)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] linux/bitmap: Replace find_fisrt_{zero_}bit with the new lightweight api

2015-11-18 Thread hejianet
Thanks, I only compiled and tested in x86_64, will check what's wrong in 
m68k


B.R.
Justin


在 11/19/15 10:53 AM, kbuild test robot 写道:

Hi Jia,

[auto build test ERROR on: v4.4-rc1]
[also build test ERROR on: next-20151118]

url:
https://github.com/0day-ci/linux/commits/Jia-He/Improve-bitmap_empty-and-bitmap_full/20151119-103554
config: m68k-allyesconfig (attached as .config)
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=m68k

All errors (new ones prefixed by >>):

kernel/time/Kconfig:155:warning: range is invalid
warning: (USB_OTG_FSM && FSL_USB2_OTG && USB_MV_OTG) selects USB_OTG which has unmet direct 
dependencies (USB_SUPPORT && USB && PM)
warning: (SINGLE_MEMORY_CHUNK) selects NEED_MULTIPLE_NODES which has unmet 
direct dependencies (DISCONTIGMEM || NUMA)
warning: (PREEMPT && DEBUG_ATOMIC_SLEEP) selects PREEMPT_COUNT which has 
unmet direct dependencies (COLDFIRE)
warning: (USB_OTG_FSM && FSL_USB2_OTG && USB_MV_OTG) selects USB_OTG which has unmet direct 
dependencies (USB_SUPPORT && USB && PM)
warning: (SINGLE_MEMORY_CHUNK) selects NEED_MULTIPLE_NODES which has unmet 
direct dependencies (DISCONTIGMEM || NUMA)
warning: (PREEMPT && DEBUG_ATOMIC_SLEEP) selects PREEMPT_COUNT which has 
unmet direct dependencies (COLDFIRE)
In file included from include/linux/cpumask.h:11:0,
 from include/linux/rcupdate.h:40,
 from include/linux/rbtree.h:34,
 from include/linux/sched.h:22,
 from arch/m68k/kernel/asm-offsets.c:14:
include/linux/bitmap.h: In function 'bitmap_empty':

include/linux/bitmap.h:284:2: error: implicit declaration of function 
'all_bit_is_zero' [-Werror=implicit-function-declaration]

  return all_bit_is_zero(src, nbits);
  ^
include/linux/bitmap.h: In function 'bitmap_full':

include/linux/bitmap.h:292:2: error: implicit declaration of function 
'all_bit_is_one' [-Werror=implicit-function-declaration]

  return all_bit_is_one(src, nbits);
  ^
cc1: some warnings being treated as errors
make[2]: *** [arch/m68k/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [sub-make] Error 2

vim +/all_bit_is_zero +284 include/linux/bitmap.h

278 
279 static inline int bitmap_empty(const unsigned long *src, unsigned nbits)
280 {
281 if (small_const_nbits(nbits))
282 return ! (*src & BITMAP_LAST_WORD_MASK(nbits));
283 
  > 284  return all_bit_is_zero(src, nbits);
285 }
286 
287 static inline int bitmap_full(const unsigned long *src, unsigned int 
nbits)
288 {
289 if (small_const_nbits(nbits))
290 return ! (~(*src) & BITMAP_LAST_WORD_MASK(nbits));
291 
  > 292  return all_bit_is_one(src, nbits);
293 }
294 
295 static __always_inline int bitmap_weight(const unsigned long *src, 
unsigned int nbits)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: lustre: Fix sparse warning on address-space

2015-01-13 Thread hejianet
Hi all
Ping again, any comments are welcome

Justin

On Mon, 29 Dec 2014 20:49:40 +0800 from hejia...@gmail.com wrote:
> This patch is to fix the below warnings generated by sparse:
>  warning: incorrect type in argument 1 (different address spaces)
>  expected void const volatile [noderef] *
>  got int *
> Adding the macro "__user" can suppress the warnings, and aslo some lines
> are adjusted for 80 chars line limitation.
> Already checked by sparse "C=2"
>
> Signed-off-by: Jia He 
> Cc: Oleg Drokin  
> Cc: Andreas Dilger  
> Cc: Greg Kroah-Hartman 
> ---
>  .../staging/lustre/lustre/include/lprocfs_status.h | 18 -
>  .../lustre/lustre/obdclass/lprocfs_status.c|  6 +--
>  drivers/staging/lustre/lustre/osc/lproc_osc.c  | 44 
> +++---
>  drivers/staging/lustre/lustre/osc/osc_request.c|  6 +--
>  .../staging/lustre/lustre/ptlrpc/lproc_ptlrpc.c|  6 +--
>  5 files changed, 40 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/include/lprocfs_status.h 
> b/drivers/staging/lustre/lustre/include/lprocfs_status.h
> index cfe503b..ed4932f 100644
> --- a/drivers/staging/lustre/lustre/include/lprocfs_status.h
> +++ b/drivers/staging/lustre/lustre/include/lprocfs_status.h
> @@ -631,13 +631,13 @@ extern int lprocfs_wr_timeouts(struct file *file, const 
> char *buffer,
>  unsigned long count, void *data);
>  extern int lprocfs_wr_evict_client(struct file *file, const char *buffer,
>   size_t count, loff_t *off);
> -extern int lprocfs_wr_ping(struct file *file, const char *buffer,
> +extern int lprocfs_wr_ping(struct file *file, const char __user *buffer,
>  size_t count, loff_t *off);
> -extern int lprocfs_wr_import(struct file *file, const char *buffer,
> +extern int lprocfs_wr_import(struct file *file, const char __user *buffer,
> size_t count, loff_t *off);
>  extern int lprocfs_rd_pinger_recov(struct seq_file *m, void *n);
> -extern int lprocfs_wr_pinger_recov(struct file *file, const char *buffer,
> -size_t count, loff_t *off);
> +extern int lprocfs_wr_pinger_recov(struct file *file,
> + const char __user *buffer, size_t count, loff_t *off);
>  
>  /* Statfs helpers */
>  extern int lprocfs_rd_blksize(struct seq_file *m, void *data);
> @@ -650,9 +650,9 @@ extern int lprocfs_rd_filesfree(struct seq_file *m, void 
> *data);
>  extern int lprocfs_write_helper(const char __user *buffer, unsigned long 
> count,
>   int *val);
>  extern int lprocfs_seq_read_frac_helper(struct seq_file *m, long val, int 
> mult);
> -extern int lprocfs_write_u64_helper(const char *buffer, unsigned long count,
> - __u64 *val);
> -extern int lprocfs_write_frac_u64_helper(const char *buffer,
> +extern int lprocfs_write_u64_helper(const char __user *buffer,
> + unsigned long count, __u64 *val);
> +extern int lprocfs_write_frac_u64_helper(const char __user *buffer,
>unsigned long count,
>__u64 *val, int mult);
>  extern char *lprocfs_find_named_value(const char *buffer, const char *name,
> @@ -716,7 +716,7 @@ static struct file_operations name##_fops = { 
> \
>   return lprocfs_rd_##type(m, m->private);\
>   }   \
>   static ssize_t name##_##type##_seq_write(struct file *file, \
> - const char *buffer, size_t count, loff_t *off)  \
> + const char __user *buffer, size_t count, loff_t *off)   \
>   {   \
>   struct seq_file *seq = file->private_data;  \
>   return lprocfs_wr_##type(file, buffer,  \
> @@ -726,7 +726,7 @@ static struct file_operations name##_fops = { 
> \
>  
>  #define LPROC_SEQ_FOPS_WR_ONLY(name, type)   \
>   static ssize_t name##_##type##_write(struct file *file, \
> - const char *buffer, size_t count, loff_t *off)  \
> + const char __user *buffer, size_t count, loff_t *off)   \
>   {   \
>   return lprocfs_wr_##type(file, buffer, count, off); \
>   }   \
> diff --git a/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c 
> b/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c
> index 3b7dfc3..9a84b52 100644
> --- a/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c
> +++ b/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c
> @@ -1849,15 +1849,15 @@ int lprocfs_seq_read_frac_helper(struct seq_file *m, 
> long val, int mult)

Re: [PATCH] staging: lustre: libcfs: fix sparse warnings about static declaration

2015-01-13 Thread hejianet
Hi all
Ping again, any comments are welcome

Justin

On Tue, 16 Dec 2014 22:39:38 +0800 from hejia...@gmail.com wrote:
> make sparse happy since these two fuchtion are only used in module.c.
> tested by successful compilation.
>
> Signed-off-by: Jia He 
> Cc: Oleg Drokin 
> Cc: Andreas Dilger 
> Cc: Greg Kroah-Hartman 
> ---
>  drivers/staging/lustre/lustre/libcfs/module.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/libcfs/module.c 
> b/drivers/staging/lustre/lustre/libcfs/module.c
> index 2c4fc74..b9ee37c 100644
> --- a/drivers/staging/lustre/lustre/libcfs/module.c
> +++ b/drivers/staging/lustre/lustre/libcfs/module.c
> @@ -42,8 +42,7 @@
>  #include "../../include/linux/lnet/lnet.h"
>  #include "tracefile.h"
>  
> -void
> -kportal_memhog_free (struct libcfs_device_userstate *ldu)
> +static void kportal_memhog_free (struct libcfs_device_userstate *ldu)
>  {
>   struct page **level0p = >ldu_memhog_root_page;
>   struct page **level1p;
> @@ -86,8 +85,7 @@ kportal_memhog_free (struct libcfs_device_userstate *ldu)
>   LASSERT (ldu->ldu_memhog_pages == 0);
>  }
>  
> -int
> -kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int npages,
> +static int kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int 
> npages,
>gfp_t flags)
>  {
>   struct page **level0p;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: lustre: Fix sparse warning on address-space

2015-01-13 Thread hejianet
Hi all
Ping again, any comments are welcome

Justin

On Mon, 29 Dec 2014 20:49:40 +0800 from hejia...@gmail.com wrote:
 This patch is to fix the below warnings generated by sparse:
  warning: incorrect type in argument 1 (different address spaces)
  expected void const volatile [noderef] asn:1*noident
  got int *noident
 Adding the macro __user can suppress the warnings, and aslo some lines
 are adjusted for 80 chars line limitation.
 Already checked by sparse C=2

 Signed-off-by: Jia He hejia...@gmail.com
 Cc: Oleg Drokin oleg.dro...@intel.com 
 Cc: Andreas Dilger andreas.dil...@intel.com 
 Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
 ---
  .../staging/lustre/lustre/include/lprocfs_status.h | 18 -
  .../lustre/lustre/obdclass/lprocfs_status.c|  6 +--
  drivers/staging/lustre/lustre/osc/lproc_osc.c  | 44 
 +++---
  drivers/staging/lustre/lustre/osc/osc_request.c|  6 +--
  .../staging/lustre/lustre/ptlrpc/lproc_ptlrpc.c|  6 +--
  5 files changed, 40 insertions(+), 40 deletions(-)

 diff --git a/drivers/staging/lustre/lustre/include/lprocfs_status.h 
 b/drivers/staging/lustre/lustre/include/lprocfs_status.h
 index cfe503b..ed4932f 100644
 --- a/drivers/staging/lustre/lustre/include/lprocfs_status.h
 +++ b/drivers/staging/lustre/lustre/include/lprocfs_status.h
 @@ -631,13 +631,13 @@ extern int lprocfs_wr_timeouts(struct file *file, const 
 char *buffer,
  unsigned long count, void *data);
  extern int lprocfs_wr_evict_client(struct file *file, const char *buffer,
   size_t count, loff_t *off);
 -extern int lprocfs_wr_ping(struct file *file, const char *buffer,
 +extern int lprocfs_wr_ping(struct file *file, const char __user *buffer,
  size_t count, loff_t *off);
 -extern int lprocfs_wr_import(struct file *file, const char *buffer,
 +extern int lprocfs_wr_import(struct file *file, const char __user *buffer,
 size_t count, loff_t *off);
  extern int lprocfs_rd_pinger_recov(struct seq_file *m, void *n);
 -extern int lprocfs_wr_pinger_recov(struct file *file, const char *buffer,
 -size_t count, loff_t *off);
 +extern int lprocfs_wr_pinger_recov(struct file *file,
 + const char __user *buffer, size_t count, loff_t *off);
  
  /* Statfs helpers */
  extern int lprocfs_rd_blksize(struct seq_file *m, void *data);
 @@ -650,9 +650,9 @@ extern int lprocfs_rd_filesfree(struct seq_file *m, void 
 *data);
  extern int lprocfs_write_helper(const char __user *buffer, unsigned long 
 count,
   int *val);
  extern int lprocfs_seq_read_frac_helper(struct seq_file *m, long val, int 
 mult);
 -extern int lprocfs_write_u64_helper(const char *buffer, unsigned long count,
 - __u64 *val);
 -extern int lprocfs_write_frac_u64_helper(const char *buffer,
 +extern int lprocfs_write_u64_helper(const char __user *buffer,
 + unsigned long count, __u64 *val);
 +extern int lprocfs_write_frac_u64_helper(const char __user *buffer,
unsigned long count,
__u64 *val, int mult);
  extern char *lprocfs_find_named_value(const char *buffer, const char *name,
 @@ -716,7 +716,7 @@ static struct file_operations name##_fops = { 
 \
   return lprocfs_rd_##type(m, m-private);\
   }   \
   static ssize_t name##_##type##_seq_write(struct file *file, \
 - const char *buffer, size_t count, loff_t *off)  \
 + const char __user *buffer, size_t count, loff_t *off)   \
   {   \
   struct seq_file *seq = file-private_data;  \
   return lprocfs_wr_##type(file, buffer,  \
 @@ -726,7 +726,7 @@ static struct file_operations name##_fops = { 
 \
  
  #define LPROC_SEQ_FOPS_WR_ONLY(name, type)   \
   static ssize_t name##_##type##_write(struct file *file, \
 - const char *buffer, size_t count, loff_t *off)  \
 + const char __user *buffer, size_t count, loff_t *off)   \
   {   \
   return lprocfs_wr_##type(file, buffer, count, off); \
   }   \
 diff --git a/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c 
 b/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c
 index 3b7dfc3..9a84b52 100644
 --- a/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c
 +++ b/drivers/staging/lustre/lustre/obdclass/lprocfs_status.c
 @@ -1849,15 +1849,15 @@ int lprocfs_seq_read_frac_helper(struct seq_file *m, 
 long 

Re: [PATCH] staging: lustre: libcfs: fix sparse warnings about static declaration

2015-01-13 Thread hejianet
Hi all
Ping again, any comments are welcome

Justin

On Tue, 16 Dec 2014 22:39:38 +0800 from hejia...@gmail.com wrote:
 make sparse happy since these two fuchtion are only used in module.c.
 tested by successful compilation.

 Signed-off-by: Jia He hejia...@gmail.com
 Cc: Oleg Drokin oleg.dro...@intel.com
 Cc: Andreas Dilger andreas.dil...@intel.com
 Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
 ---
  drivers/staging/lustre/lustre/libcfs/module.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

 diff --git a/drivers/staging/lustre/lustre/libcfs/module.c 
 b/drivers/staging/lustre/lustre/libcfs/module.c
 index 2c4fc74..b9ee37c 100644
 --- a/drivers/staging/lustre/lustre/libcfs/module.c
 +++ b/drivers/staging/lustre/lustre/libcfs/module.c
 @@ -42,8 +42,7 @@
  #include ../../include/linux/lnet/lnet.h
  #include tracefile.h
  
 -void
 -kportal_memhog_free (struct libcfs_device_userstate *ldu)
 +static void kportal_memhog_free (struct libcfs_device_userstate *ldu)
  {
   struct page **level0p = ldu-ldu_memhog_root_page;
   struct page **level1p;
 @@ -86,8 +85,7 @@ kportal_memhog_free (struct libcfs_device_userstate *ldu)
   LASSERT (ldu-ldu_memhog_pages == 0);
  }
  
 -int
 -kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int npages,
 +static int kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int 
 npages,
gfp_t flags)
  {
   struct page **level0p;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: lustre: libcfs: fix sparse warnings about static declaration

2014-12-28 Thread hejianet
Ping.
Thanks for any comments

On Tue, 16 Dec 2014 22:39:38 +0800 from hejia...@gmail.com wrote:
> make sparse happy since these two fuchtion are only used in module.c.
> tested by successful compilation.
>
> Signed-off-by: Jia He 
> Cc: Oleg Drokin 
> Cc: Andreas Dilger 
> Cc: Greg Kroah-Hartman 
> ---
>  drivers/staging/lustre/lustre/libcfs/module.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/libcfs/module.c 
> b/drivers/staging/lustre/lustre/libcfs/module.c
> index 2c4fc74..b9ee37c 100644
> --- a/drivers/staging/lustre/lustre/libcfs/module.c
> +++ b/drivers/staging/lustre/lustre/libcfs/module.c
> @@ -42,8 +42,7 @@
>  #include "../../include/linux/lnet/lnet.h"
>  #include "tracefile.h"
>  
> -void
> -kportal_memhog_free (struct libcfs_device_userstate *ldu)
> +static void kportal_memhog_free (struct libcfs_device_userstate *ldu)
>  {
>   struct page **level0p = >ldu_memhog_root_page;
>   struct page **level1p;
> @@ -86,8 +85,7 @@ kportal_memhog_free (struct libcfs_device_userstate *ldu)
>   LASSERT (ldu->ldu_memhog_pages == 0);
>  }
>  
> -int
> -kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int npages,
> +static int kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int 
> npages,
>gfp_t flags)
>  {
>   struct page **level0p;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: lustre: libcfs: fix sparse warnings about static declaration

2014-12-28 Thread hejianet
Ping.
Thanks for any comments

On Tue, 16 Dec 2014 22:39:38 +0800 from hejia...@gmail.com wrote:
 make sparse happy since these two fuchtion are only used in module.c.
 tested by successful compilation.

 Signed-off-by: Jia He hejia...@gmail.com
 Cc: Oleg Drokin oleg.dro...@intel.com
 Cc: Andreas Dilger andreas.dil...@intel.com
 Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
 ---
  drivers/staging/lustre/lustre/libcfs/module.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

 diff --git a/drivers/staging/lustre/lustre/libcfs/module.c 
 b/drivers/staging/lustre/lustre/libcfs/module.c
 index 2c4fc74..b9ee37c 100644
 --- a/drivers/staging/lustre/lustre/libcfs/module.c
 +++ b/drivers/staging/lustre/lustre/libcfs/module.c
 @@ -42,8 +42,7 @@
  #include ../../include/linux/lnet/lnet.h
  #include tracefile.h
  
 -void
 -kportal_memhog_free (struct libcfs_device_userstate *ldu)
 +static void kportal_memhog_free (struct libcfs_device_userstate *ldu)
  {
   struct page **level0p = ldu-ldu_memhog_root_page;
   struct page **level1p;
 @@ -86,8 +85,7 @@ kportal_memhog_free (struct libcfs_device_userstate *ldu)
   LASSERT (ldu-ldu_memhog_pages == 0);
  }
  
 -int
 -kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int npages,
 +static int kportal_memhog_alloc(struct libcfs_device_userstate *ldu, int 
 npages,
gfp_t flags)
  {
   struct page **level0p;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: rtl8188eu: Fix coding style space related ERROR problems

2014-11-03 Thread hejianet
Hi Greg
Sorry for the inconvience.
Seems I generated the patch @ your master branch instead of staging-test.
I will rebase it later.

B.R.
Justin

On Mon, 3 Nov 2014 16:25:57 -0800 from gre...@linuxfoundation.org wrote:
> On Fri, Oct 31, 2014 at 10:07:15AM +0800, Jia He wrote:
>> This fixes space related ERROR reports by checkpatch.pl
>> Generated by $ git ls-files "drivers/staging/rtl8188eu/*.[ch]" | \
>>   xargs ./scripts/checkpatch.pl -f --fix-inplace --strict --types=SPACING
>> Already checked by text comparasion
>> $git diff -w
>> and binary comparasion of r8188eu.ko
>> $objdiff diff  
>>
>> Signed-off-by: Jia He 
> Doesn't apply to my tree at all:
>
> checking file drivers/staging/rtl8188eu/core/rtw_cmd.c
> Hunk #1 succeeded at 637 (offset -1 lines).
> checking file drivers/staging/rtl8188eu/core/rtw_ieee80211.c
> checking file drivers/staging/rtl8188eu/core/rtw_mlme.c
> Hunk #1 succeeded at 2115 (offset -2 lines).
> checking file drivers/staging/rtl8188eu/core/rtw_mlme_ext.c
> checking file drivers/staging/rtl8188eu/core/rtw_pwrctrl.c
> Reversed (or previously applied) patch detected!  Assume -R? [n] 
> Apply anyway? [n] 
> Skipping patch.
> 1 out of 1 hunk ignored
> checking file drivers/staging/rtl8188eu/core/rtw_recv.c
> checking file drivers/staging/rtl8188eu/core/rtw_security.c
> checking file drivers/staging/rtl8188eu/core/rtw_sta_mgt.c
> checking file drivers/staging/rtl8188eu/core/rtw_xmit.c
> checking file drivers/staging/rtl8188eu/hal/bb_cfg.c
> Reversed (or previously applied) patch detected!  Assume -R? [n] 
> Apply anyway? [n] 
> Skipping patch.
> 2 out of 2 hunks ignored
> checking file drivers/staging/rtl8188eu/hal/fw.c
> checking file drivers/staging/rtl8188eu/hal/mac_cfg.c
> checking file drivers/staging/rtl8188eu/hal/odm.c
> checking file drivers/staging/rtl8188eu/hal/odm_HWConfig.c
> checking file drivers/staging/rtl8188eu/hal/odm_RTL8188E.c
> checking file drivers/staging/rtl8188eu/hal/phy.c
> checking file drivers/staging/rtl8188eu/hal/rf.c
> checking file drivers/staging/rtl8188eu/hal/rf_cfg.c
> checking file drivers/staging/rtl8188eu/hal/rtl8188e_cmd.c
> Hunk #1 succeeded at 252 (offset -2 lines).
> checking file drivers/staging/rtl8188eu/hal/rtl8188eu_xmit.c
> checking file drivers/staging/rtl8188eu/hal/usb_halinit.c
> checking file drivers/staging/rtl8188eu/include/ieee80211_ext.h
> checking file drivers/staging/rtl8188eu/include/odm_debug.h
> Hunk #1 FAILED at 125.
> 1 out of 1 hunk FAILED
> checking file drivers/staging/rtl8188eu/include/osdep_service.h
> checking file drivers/staging/rtl8188eu/include/rtw_debug.h
> checking file drivers/staging/rtl8188eu/include/rtw_led.h
> checking file drivers/staging/rtl8188eu/include/rtw_mlme_ext.h
> checking file drivers/staging/rtl8188eu/include/wifi.h
> checking file drivers/staging/rtl8188eu/os_dep/ioctl_linux.c
> checking file drivers/staging/rtl8188eu/os_dep/os_intfs.c
> checking file drivers/staging/rtl8188eu/os_dep/usb_ops_linux.c
> checking file drivers/staging/rtl8188eu/os_dep/xmit_linux.c
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: rtl8188eu: Fix coding style space related ERROR problems

2014-11-03 Thread hejianet
Hi Greg
Sorry for the inconvience.
Seems I generated the patch @ your master branch instead of staging-test.
I will rebase it later.

B.R.
Justin

On Mon, 3 Nov 2014 16:25:57 -0800 from gre...@linuxfoundation.org wrote:
 On Fri, Oct 31, 2014 at 10:07:15AM +0800, Jia He wrote:
 This fixes space related ERROR reports by checkpatch.pl
 Generated by $ git ls-files drivers/staging/rtl8188eu/*.[ch] | \
   xargs ./scripts/checkpatch.pl -f --fix-inplace --strict --types=SPACING
 Already checked by text comparasion
 $git diff -w
 and binary comparasion of r8188eu.ko
 $objdiff diff old_commit new_commit

 Signed-off-by: Jia He hejia...@gmail.com
 Doesn't apply to my tree at all:

 checking file drivers/staging/rtl8188eu/core/rtw_cmd.c
 Hunk #1 succeeded at 637 (offset -1 lines).
 checking file drivers/staging/rtl8188eu/core/rtw_ieee80211.c
 checking file drivers/staging/rtl8188eu/core/rtw_mlme.c
 Hunk #1 succeeded at 2115 (offset -2 lines).
 checking file drivers/staging/rtl8188eu/core/rtw_mlme_ext.c
 checking file drivers/staging/rtl8188eu/core/rtw_pwrctrl.c
 Reversed (or previously applied) patch detected!  Assume -R? [n] 
 Apply anyway? [n] 
 Skipping patch.
 1 out of 1 hunk ignored
 checking file drivers/staging/rtl8188eu/core/rtw_recv.c
 checking file drivers/staging/rtl8188eu/core/rtw_security.c
 checking file drivers/staging/rtl8188eu/core/rtw_sta_mgt.c
 checking file drivers/staging/rtl8188eu/core/rtw_xmit.c
 checking file drivers/staging/rtl8188eu/hal/bb_cfg.c
 Reversed (or previously applied) patch detected!  Assume -R? [n] 
 Apply anyway? [n] 
 Skipping patch.
 2 out of 2 hunks ignored
 checking file drivers/staging/rtl8188eu/hal/fw.c
 checking file drivers/staging/rtl8188eu/hal/mac_cfg.c
 checking file drivers/staging/rtl8188eu/hal/odm.c
 checking file drivers/staging/rtl8188eu/hal/odm_HWConfig.c
 checking file drivers/staging/rtl8188eu/hal/odm_RTL8188E.c
 checking file drivers/staging/rtl8188eu/hal/phy.c
 checking file drivers/staging/rtl8188eu/hal/rf.c
 checking file drivers/staging/rtl8188eu/hal/rf_cfg.c
 checking file drivers/staging/rtl8188eu/hal/rtl8188e_cmd.c
 Hunk #1 succeeded at 252 (offset -2 lines).
 checking file drivers/staging/rtl8188eu/hal/rtl8188eu_xmit.c
 checking file drivers/staging/rtl8188eu/hal/usb_halinit.c
 checking file drivers/staging/rtl8188eu/include/ieee80211_ext.h
 checking file drivers/staging/rtl8188eu/include/odm_debug.h
 Hunk #1 FAILED at 125.
 1 out of 1 hunk FAILED
 checking file drivers/staging/rtl8188eu/include/osdep_service.h
 checking file drivers/staging/rtl8188eu/include/rtw_debug.h
 checking file drivers/staging/rtl8188eu/include/rtw_led.h
 checking file drivers/staging/rtl8188eu/include/rtw_mlme_ext.h
 checking file drivers/staging/rtl8188eu/include/wifi.h
 checking file drivers/staging/rtl8188eu/os_dep/ioctl_linux.c
 checking file drivers/staging/rtl8188eu/os_dep/os_intfs.c
 checking file drivers/staging/rtl8188eu/os_dep/usb_ops_linux.c
 checking file drivers/staging/rtl8188eu/os_dep/xmit_linux.c




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3] staging: rtl8188eu: Fix coding style space related ERROR problems

2014-11-02 Thread hejianet
Hi Greg

On Thu, 30 Oct 2014 08:29:14 -0700 from gre...@linuxfoundation.org wrote:
> On Thu, Oct 30, 2014 at 11:21:41PM +0800, Jia He wrote:
>> >From v2: rebased upon gregkh/staging.git
>> >From v1: suggested by j...@perches.com, fixed some points not checked
>> or not correctly checked by checkpatch.pl
> What is all of this in here for?
This is my changelog for patch V1 -> V3
And I had removed it in my latest mail if it was boring.
Thanks for any further comments
B.R.
Justin(Jia He)
>
> And you have odd spacing problems in your Subject :(
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3] staging: rtl8188eu: Fix coding style space related ERROR problems

2014-11-02 Thread hejianet
Hi Greg

On Thu, 30 Oct 2014 08:29:14 -0700 from gre...@linuxfoundation.org wrote:
 On Thu, Oct 30, 2014 at 11:21:41PM +0800, Jia He wrote:
 From v2: rebased upon gregkh/staging.git
 From v1: suggested by j...@perches.com, fixed some points not checked
 or not correctly checked by checkpatch.pl
 What is all of this in here for?
This is my changelog for patch V1 - V3
And I had removed it in my latest mail if it was boring.
Thanks for any further comments
B.R.
Justin(Jia He)

 And you have odd spacing problems in your Subject :(

 greg k-h
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 .


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2] staging: rtl8188eu: Fix coding style space related ERROR problems

2014-10-16 Thread hejianet
 Hi Greg
Just ping it, coz this is a practising patch for Eudyptula Challenge task 10.
Any comment is welcome, thank you :)

On Tue, 14 Oct 2014 11:28:32 +0800 from hejia...@gmail.com wrote:
> This fixes space related ERROR reports by checkpatch.pl
> Generated by $ git ls-files "drivers/staging/rtl8188eu/*.[ch]" | \
>   xargs ./scripts/checkpatch.pl -f --fix-inplace --strict --types=SPACING
> Already checked by text comparasion
> $git diff -w 
> and binary comparasion of r8188eu.ko
> $objdiff diff  
>
> Signed-off-by: Jia He 
> Cc: Greg Kroah-Hartman 
> ---
>  drivers/staging/rtl8188eu/core/rtw_cmd.c  |  2 +-
>  drivers/staging/rtl8188eu/core/rtw_ieee80211.c| 16 +-
>  drivers/staging/rtl8188eu/core/rtw_mlme.c |  2 +-
>  drivers/staging/rtl8188eu/core/rtw_mlme_ext.c | 22 +++---
>  drivers/staging/rtl8188eu/core/rtw_pwrctrl.c  |  2 +-
>  drivers/staging/rtl8188eu/core/rtw_recv.c | 12 
>  drivers/staging/rtl8188eu/core/rtw_security.c | 20 ++---
>  drivers/staging/rtl8188eu/core/rtw_sta_mgt.c  |  2 +-
>  drivers/staging/rtl8188eu/core/rtw_xmit.c | 10 +++
>  drivers/staging/rtl8188eu/hal/bb_cfg.c|  4 +--
>  drivers/staging/rtl8188eu/hal/fw.c|  8 ++---
>  drivers/staging/rtl8188eu/hal/mac_cfg.c   |  2 +-
>  drivers/staging/rtl8188eu/hal/odm.c   |  8 ++---
>  drivers/staging/rtl8188eu/hal/odm_HWConfig.c  |  2 +-
>  drivers/staging/rtl8188eu/hal/odm_RTL8188E.c  |  2 +-
>  drivers/staging/rtl8188eu/hal/phy.c   |  2 +-
>  drivers/staging/rtl8188eu/hal/rf.c|  4 +--
>  drivers/staging/rtl8188eu/hal/rf_cfg.c|  4 +--
>  drivers/staging/rtl8188eu/hal/rtl8188e_cmd.c  |  2 +-
>  drivers/staging/rtl8188eu/hal/rtl8188eu_xmit.c|  2 +-
>  drivers/staging/rtl8188eu/hal/usb_halinit.c   |  2 +-
>  drivers/staging/rtl8188eu/include/ieee80211_ext.h | 20 ++---
>  drivers/staging/rtl8188eu/include/odm_debug.h |  2 +-
>  drivers/staging/rtl8188eu/include/osdep_service.h |  4 +--
>  drivers/staging/rtl8188eu/include/rtw_debug.h |  2 +-
>  drivers/staging/rtl8188eu/include/rtw_led.h   |  2 +-
>  drivers/staging/rtl8188eu/include/rtw_mlme_ext.h  | 26 
>  drivers/staging/rtl8188eu/include/wifi.h  | 36 
> +++
>  drivers/staging/rtl8188eu/os_dep/ioctl_linux.c|  2 +-
>  drivers/staging/rtl8188eu/os_dep/os_intfs.c   |  2 +-
>  drivers/staging/rtl8188eu/os_dep/usb_ops_linux.c  |  4 +--
>  drivers/staging/rtl8188eu/os_dep/xmit_linux.c |  2 +-
>  32 files changed, 116 insertions(+), 116 deletions(-)
>
> diff --git a/drivers/staging/rtl8188eu/core/rtw_cmd.c 
> b/drivers/staging/rtl8188eu/core/rtw_cmd.c
> index 9935e66..7b59a10 100644
> --- a/drivers/staging/rtl8188eu/core/rtw_cmd.c
> +++ b/drivers/staging/rtl8188eu/core/rtw_cmd.c
> @@ -638,7 +638,7 @@ u8 rtw_setstakey_cmd(struct adapter *padapter, u8 *psta, 
> u8 unicast_key)
>   ether_addr_copy(psetstakey_para->addr, sta->hwaddr);
>  
>   if (check_fwstate(pmlmepriv, WIFI_STATION_STATE))
> - psetstakey_para->algorithm = (unsigned char) 
> psecuritypriv->dot11PrivacyAlgrthm;
> + psetstakey_para->algorithm = (unsigned 
> char)psecuritypriv->dot11PrivacyAlgrthm;
>   else
>   GET_ENCRY_ALGO(psecuritypriv, sta, psetstakey_para->algorithm, 
> false);
>  
> diff --git a/drivers/staging/rtl8188eu/core/rtw_ieee80211.c 
> b/drivers/staging/rtl8188eu/core/rtw_ieee80211.c
> index 755d3ef..f2c3ca7 100644
> --- a/drivers/staging/rtl8188eu/core/rtw_ieee80211.c
> +++ b/drivers/staging/rtl8188eu/core/rtw_ieee80211.c
> @@ -159,7 +159,7 @@ u8 *rtw_set_ie
>   return pbuf + len + 2;
>  }
>  
> -inline u8 *rtw_set_ie_ch_switch (u8 *buf, u32 *buf_len, u8 ch_switch_mode,
> +inline u8 *rtw_set_ie_ch_switch(u8 *buf, u32 *buf_len, u8 ch_switch_mode,
>   u8 new_ch, u8 ch_switch_cnt)
>  {
>   u8 ie_data[3];
> @@ -870,7 +870,7 @@ static int rtw_ieee802_11_parse_vendor_specific(u8 *pos, 
> uint elen,
>   if (elen < 4) {
>   if (show_errors) {
>   DBG_88E("short vendor specific information element 
> ignored (len=%lu)\n",
> - (unsigned long) elen);
> + (unsigned long)elen);
>   }
>   return -1;
>   }
> @@ -890,7 +890,7 @@ static int rtw_ieee802_11_parse_vendor_specific(u8 *pos, 
> uint elen,
>   case WME_OUI_TYPE: /* this is a Wi-Fi WME info. element */
>   if (elen < 5) {
>   DBG_88E("short WME information element ignored 
> (len=%lu)\n",
> - (unsigned long) elen);
> + (unsigned long)elen);
>   return -1;
>   }
>   switch (pos[4]) {
> @@ -905,7 +905,7 @@ static int 

Re: [PATCHv2] staging: rtl8188eu: Fix coding style space related ERROR problems

2014-10-16 Thread hejianet
 Hi Greg
Just ping it, coz this is a practising patch for Eudyptula Challenge task 10.
Any comment is welcome, thank you :)

On Tue, 14 Oct 2014 11:28:32 +0800 from hejia...@gmail.com wrote:
 This fixes space related ERROR reports by checkpatch.pl
 Generated by $ git ls-files drivers/staging/rtl8188eu/*.[ch] | \
   xargs ./scripts/checkpatch.pl -f --fix-inplace --strict --types=SPACING
 Already checked by text comparasion
 $git diff -w 
 and binary comparasion of r8188eu.ko
 $objdiff diff old_commit new_commit

 Signed-off-by: Jia He hejia...@gmail.com
 Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
 ---
  drivers/staging/rtl8188eu/core/rtw_cmd.c  |  2 +-
  drivers/staging/rtl8188eu/core/rtw_ieee80211.c| 16 +-
  drivers/staging/rtl8188eu/core/rtw_mlme.c |  2 +-
  drivers/staging/rtl8188eu/core/rtw_mlme_ext.c | 22 +++---
  drivers/staging/rtl8188eu/core/rtw_pwrctrl.c  |  2 +-
  drivers/staging/rtl8188eu/core/rtw_recv.c | 12 
  drivers/staging/rtl8188eu/core/rtw_security.c | 20 ++---
  drivers/staging/rtl8188eu/core/rtw_sta_mgt.c  |  2 +-
  drivers/staging/rtl8188eu/core/rtw_xmit.c | 10 +++
  drivers/staging/rtl8188eu/hal/bb_cfg.c|  4 +--
  drivers/staging/rtl8188eu/hal/fw.c|  8 ++---
  drivers/staging/rtl8188eu/hal/mac_cfg.c   |  2 +-
  drivers/staging/rtl8188eu/hal/odm.c   |  8 ++---
  drivers/staging/rtl8188eu/hal/odm_HWConfig.c  |  2 +-
  drivers/staging/rtl8188eu/hal/odm_RTL8188E.c  |  2 +-
  drivers/staging/rtl8188eu/hal/phy.c   |  2 +-
  drivers/staging/rtl8188eu/hal/rf.c|  4 +--
  drivers/staging/rtl8188eu/hal/rf_cfg.c|  4 +--
  drivers/staging/rtl8188eu/hal/rtl8188e_cmd.c  |  2 +-
  drivers/staging/rtl8188eu/hal/rtl8188eu_xmit.c|  2 +-
  drivers/staging/rtl8188eu/hal/usb_halinit.c   |  2 +-
  drivers/staging/rtl8188eu/include/ieee80211_ext.h | 20 ++---
  drivers/staging/rtl8188eu/include/odm_debug.h |  2 +-
  drivers/staging/rtl8188eu/include/osdep_service.h |  4 +--
  drivers/staging/rtl8188eu/include/rtw_debug.h |  2 +-
  drivers/staging/rtl8188eu/include/rtw_led.h   |  2 +-
  drivers/staging/rtl8188eu/include/rtw_mlme_ext.h  | 26 
  drivers/staging/rtl8188eu/include/wifi.h  | 36 
 +++
  drivers/staging/rtl8188eu/os_dep/ioctl_linux.c|  2 +-
  drivers/staging/rtl8188eu/os_dep/os_intfs.c   |  2 +-
  drivers/staging/rtl8188eu/os_dep/usb_ops_linux.c  |  4 +--
  drivers/staging/rtl8188eu/os_dep/xmit_linux.c |  2 +-
  32 files changed, 116 insertions(+), 116 deletions(-)

 diff --git a/drivers/staging/rtl8188eu/core/rtw_cmd.c 
 b/drivers/staging/rtl8188eu/core/rtw_cmd.c
 index 9935e66..7b59a10 100644
 --- a/drivers/staging/rtl8188eu/core/rtw_cmd.c
 +++ b/drivers/staging/rtl8188eu/core/rtw_cmd.c
 @@ -638,7 +638,7 @@ u8 rtw_setstakey_cmd(struct adapter *padapter, u8 *psta, 
 u8 unicast_key)
   ether_addr_copy(psetstakey_para-addr, sta-hwaddr);
  
   if (check_fwstate(pmlmepriv, WIFI_STATION_STATE))
 - psetstakey_para-algorithm = (unsigned char) 
 psecuritypriv-dot11PrivacyAlgrthm;
 + psetstakey_para-algorithm = (unsigned 
 char)psecuritypriv-dot11PrivacyAlgrthm;
   else
   GET_ENCRY_ALGO(psecuritypriv, sta, psetstakey_para-algorithm, 
 false);
  
 diff --git a/drivers/staging/rtl8188eu/core/rtw_ieee80211.c 
 b/drivers/staging/rtl8188eu/core/rtw_ieee80211.c
 index 755d3ef..f2c3ca7 100644
 --- a/drivers/staging/rtl8188eu/core/rtw_ieee80211.c
 +++ b/drivers/staging/rtl8188eu/core/rtw_ieee80211.c
 @@ -159,7 +159,7 @@ u8 *rtw_set_ie
   return pbuf + len + 2;
  }
  
 -inline u8 *rtw_set_ie_ch_switch (u8 *buf, u32 *buf_len, u8 ch_switch_mode,
 +inline u8 *rtw_set_ie_ch_switch(u8 *buf, u32 *buf_len, u8 ch_switch_mode,
   u8 new_ch, u8 ch_switch_cnt)
  {
   u8 ie_data[3];
 @@ -870,7 +870,7 @@ static int rtw_ieee802_11_parse_vendor_specific(u8 *pos, 
 uint elen,
   if (elen  4) {
   if (show_errors) {
   DBG_88E(short vendor specific information element 
 ignored (len=%lu)\n,
 - (unsigned long) elen);
 + (unsigned long)elen);
   }
   return -1;
   }
 @@ -890,7 +890,7 @@ static int rtw_ieee802_11_parse_vendor_specific(u8 *pos, 
 uint elen,
   case WME_OUI_TYPE: /* this is a Wi-Fi WME info. element */
   if (elen  5) {
   DBG_88E(short WME information element ignored 
 (len=%lu)\n,
 - (unsigned long) elen);
 + (unsigned long)elen);
   return -1;
   }
   switch (pos[4]) {
 @@ -905,7 +905,7 @@ static int rtw_ieee802_11_parse_vendor_specific(u8 *pos, 
 uint elen,
  

Re: [PATCH 1/2] staging: rtl8188eu: Fix coding style space missing problems

2014-10-12 Thread hejianet
 Thanks for the suggestion, I will update it later

On Thu, 09 Oct 2014 07:35:59 -0700 from j...@perches.com wrote:
> git ls-files "drivers/staging/rtl8188eu/*.[ch]" | \
>   xargs ./scripts/checkpatch.pl -f --fix-inplace --strict --types=SPACING

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] staging: rtl8188eu: Fix coding style space missing problems

2014-10-12 Thread hejianet
 Thanks for the suggestion, I will update it later

On Thu, 09 Oct 2014 07:35:59 -0700 from j...@perches.com wrote:
 git ls-files drivers/staging/rtl8188eu/*.[ch] | \
   xargs ./scripts/checkpatch.pl -f --fix-inplace --strict --types=SPACING

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/