Re: WARNING in kvmalloc_node

2018-02-14 Thread Jesper Dangaard Brouer
On Wed, 14 Feb 2018 13:17:18 +0100
Daniel Borkmann <dan...@iogearbox.net> wrote:

> On 02/14/2018 01:02 PM, Jason Wang wrote:
> > On 2018年02月14日 19:51, Michal Hocko wrote:  
> >> On Wed 14-02-18 19:47:30, Jason Wang wrote:  
> >>> On 2018年02月14日 17:28, Daniel Borkmann wrote:  
> >>>> [ +Jason, +Jesper ]
> >>>>
> >>>> On 02/14/2018 09:43 AM, Michal Hocko wrote:  
> >>>>> On Tue 13-02-18 18:55:33, Matthew Wilcox wrote:  
> >>>>>> On Tue, Feb 13, 2018 at 03:59:01PM -0800, syzbot wrote:  
> >>>>> [...]  
> >>>>>>>    kvmalloc include/linux/mm.h:541 [inline]
> >>>>>>>    kvmalloc_array include/linux/mm.h:557 [inline]
> >>>>>>>    __ptr_ring_init_queue_alloc include/linux/ptr_ring.h:474 [inline]
> >>>>>>>    ptr_ring_init include/linux/ptr_ring.h:492 [inline]
> >>>>>>>    __cpu_map_entry_alloc kernel/bpf/cpumap.c:359 [inline]
> >>>>>>>    cpu_map_update_elem+0x3c3/0x8e0 kernel/bpf/cpumap.c:490
> >>>>>>>    map_update_elem kernel/bpf/syscall.c:698 [inline]  
> >>>>>> Blame the BPF people, not the MM people ;-)  
> >>>> Heh, not really. ;-)
> >>>>  
> >>>>> Yes. kvmalloc (the vmalloc part) doesn't support GFP_ATOMIC semantic.  
> >>>> Agree, that doesn't work.
> >>>>
> >>>> Bug was added in commit 0bf7800f1799 ("ptr_ring: try vmalloc() when 
> >>>> kmalloc() fails").
> >>>>
> >>>> Jason, please take a look at fixing this, thanks!  
> >>> It looks to me the only solution is to revert that commit.  
> >> Do you really need this to be GFP_ATOMIC? I can see some callers are
> >> under RCU read lock but can we perhaps do the allocation outside of this
> >> section?  
> > 
> > If I understand the code correctly, the code would be called by XDP program 
> > (usually run inside a bh) which makes it hard to do this.
> > 
> > Rethink of this, we can probably test gfp and not call kvmalloc if 
> > GFP_ATOMIC is set in __ptr_ring_init_queue_alloc().  
> 
> That would be one option indeed (probably useful in any case to make the API
> more robust). Another one is to just not use GFP_ATOMIC in cpumap. Looking at
> it, update can neither be called out of a BPF prog since prevented by verifier
> nor under RCU reader side when updating this type of map from syscall path.
> Jesper, any concrete reason we still need GFP_ATOMIC here?

Allocations in cpumap (related to ptr_ring) should only be possible to
be initiated through userspace via bpf-syscall. Thus, there isn't any
reason for GFP_ATOMIC here.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: WARNING in kvmalloc_node

2018-02-14 Thread Jesper Dangaard Brouer
On Wed, 14 Feb 2018 13:17:18 +0100
Daniel Borkmann  wrote:

> On 02/14/2018 01:02 PM, Jason Wang wrote:
> > On 2018年02月14日 19:51, Michal Hocko wrote:  
> >> On Wed 14-02-18 19:47:30, Jason Wang wrote:  
> >>> On 2018年02月14日 17:28, Daniel Borkmann wrote:  
> >>>> [ +Jason, +Jesper ]
> >>>>
> >>>> On 02/14/2018 09:43 AM, Michal Hocko wrote:  
> >>>>> On Tue 13-02-18 18:55:33, Matthew Wilcox wrote:  
> >>>>>> On Tue, Feb 13, 2018 at 03:59:01PM -0800, syzbot wrote:  
> >>>>> [...]  
> >>>>>>>    kvmalloc include/linux/mm.h:541 [inline]
> >>>>>>>    kvmalloc_array include/linux/mm.h:557 [inline]
> >>>>>>>    __ptr_ring_init_queue_alloc include/linux/ptr_ring.h:474 [inline]
> >>>>>>>    ptr_ring_init include/linux/ptr_ring.h:492 [inline]
> >>>>>>>    __cpu_map_entry_alloc kernel/bpf/cpumap.c:359 [inline]
> >>>>>>>    cpu_map_update_elem+0x3c3/0x8e0 kernel/bpf/cpumap.c:490
> >>>>>>>    map_update_elem kernel/bpf/syscall.c:698 [inline]  
> >>>>>> Blame the BPF people, not the MM people ;-)  
> >>>> Heh, not really. ;-)
> >>>>  
> >>>>> Yes. kvmalloc (the vmalloc part) doesn't support GFP_ATOMIC semantic.  
> >>>> Agree, that doesn't work.
> >>>>
> >>>> Bug was added in commit 0bf7800f1799 ("ptr_ring: try vmalloc() when 
> >>>> kmalloc() fails").
> >>>>
> >>>> Jason, please take a look at fixing this, thanks!  
> >>> It looks to me the only solution is to revert that commit.  
> >> Do you really need this to be GFP_ATOMIC? I can see some callers are
> >> under RCU read lock but can we perhaps do the allocation outside of this
> >> section?  
> > 
> > If I understand the code correctly, the code would be called by XDP program 
> > (usually run inside a bh) which makes it hard to do this.
> > 
> > Rethink of this, we can probably test gfp and not call kvmalloc if 
> > GFP_ATOMIC is set in __ptr_ring_init_queue_alloc().  
> 
> That would be one option indeed (probably useful in any case to make the API
> more robust). Another one is to just not use GFP_ATOMIC in cpumap. Looking at
> it, update can neither be called out of a BPF prog since prevented by verifier
> nor under RCU reader side when updating this type of map from syscall path.
> Jesper, any concrete reason we still need GFP_ATOMIC here?

Allocations in cpumap (related to ptr_ring) should only be possible to
be initiated through userspace via bpf-syscall. Thus, there isn't any
reason for GFP_ATOMIC here.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 0/2] rcu: Transform kfree_rcu() into kvfree_rcu()

2018-02-07 Thread Jesper Dangaard Brouer
On Wed, 7 Feb 2018 08:57:00 -0500
Steven Rostedt <rost...@goodmis.org> wrote:

> On Wed, 7 Feb 2018 00:31:04 -0800
> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> 
> > I see problems.  We would then have two different names for exactly the
> > same thing.
> > 
> > Seems like it would be a lot easier to simply document the existing
> > kfree_rcu() behavior, especially given that it apparently already works.
> > The really doesn't seem to me to be worth a name change.  
> 
> Honestly, I don't believe this is an RCU sub-system decision. This is a
> memory management decision.
> 
> If we have kmalloc(), vmalloc(), kfree(), vfree() and kvfree(), and we
> want kmalloc() to be freed with kfree(), and vmalloc() to be freed with
> vfree(), and for strange reasons, we don't know how the data was
> allocated we have kvfree(). That's an mm decision not an rcu one. We
> should have kfree_rcu(), vfree_rcu() and kvfree_rcu(), and honestly,
> they should not depend on kvfree() doing the same thing for everything.
> Each should call the corresponding member that they represent. Which
> would change this patch set.
> 
> Why? Too much coupling between RCU and MM. What if in the future
> something changes and kvfree() goes away or changes drastically. We
> would then have to go through all the users of RCU to change them too.
> 
> To me kvfree() is a special case and should not be used by RCU as a
> generic function. That would make RCU and MM much more coupled than
> necessary.

For the record, I fully agree with Steve here. 

And being a performance "fanatic" I don't like to have the extra branch
(and compares) in the free code path... but it's a MM-decision (and
sometimes you should not listen to "fanatics" ;-))

void kvfree(const void *addr)
{
    if (is_vmalloc_addr(addr))
vfree(addr);
else
kfree(addr);
}
EXPORT_SYMBOL(kvfree);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


static inline bool is_vmalloc_addr(const void *x)
{
#ifdef CONFIG_MMU
unsigned long addr = (unsigned long)x;

return addr >= VMALLOC_START && addr < VMALLOC_END;
#else
return false;
#endif
}


Re: [PATCH 0/2] rcu: Transform kfree_rcu() into kvfree_rcu()

2018-02-07 Thread Jesper Dangaard Brouer
On Wed, 7 Feb 2018 08:57:00 -0500
Steven Rostedt  wrote:

> On Wed, 7 Feb 2018 00:31:04 -0800
> "Paul E. McKenney"  wrote:
> 
> > I see problems.  We would then have two different names for exactly the
> > same thing.
> > 
> > Seems like it would be a lot easier to simply document the existing
> > kfree_rcu() behavior, especially given that it apparently already works.
> > The really doesn't seem to me to be worth a name change.  
> 
> Honestly, I don't believe this is an RCU sub-system decision. This is a
> memory management decision.
> 
> If we have kmalloc(), vmalloc(), kfree(), vfree() and kvfree(), and we
> want kmalloc() to be freed with kfree(), and vmalloc() to be freed with
> vfree(), and for strange reasons, we don't know how the data was
> allocated we have kvfree(). That's an mm decision not an rcu one. We
> should have kfree_rcu(), vfree_rcu() and kvfree_rcu(), and honestly,
> they should not depend on kvfree() doing the same thing for everything.
> Each should call the corresponding member that they represent. Which
> would change this patch set.
> 
> Why? Too much coupling between RCU and MM. What if in the future
> something changes and kvfree() goes away or changes drastically. We
> would then have to go through all the users of RCU to change them too.
> 
> To me kvfree() is a special case and should not be used by RCU as a
> generic function. That would make RCU and MM much more coupled than
> necessary.

For the record, I fully agree with Steve here. 

And being a performance "fanatic" I don't like to have the extra branch
(and compares) in the free code path... but it's a MM-decision (and
sometimes you should not listen to "fanatics" ;-))

void kvfree(const void *addr)
{
if (is_vmalloc_addr(addr))
vfree(addr);
else
kfree(addr);
}
EXPORT_SYMBOL(kvfree);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


static inline bool is_vmalloc_addr(const void *x)
{
#ifdef CONFIG_MMU
unsigned long addr = (unsigned long)x;

return addr >= VMALLOC_START && addr < VMALLOC_END;
#else
return false;
#endif
}


Re: [PATCH] samples/bpf: Add program for CPU state statistics

2018-01-31 Thread Jesper Dangaard Brouer
On Wed, 31 Jan 2018 02:29:59 +0800
Leo Yan <leo@linaro.org> wrote:

> CPU 0
> State: Duration(ms)  Distribution
> cstate 0 : 47555|*   |
> cstate 1 : 0||
> cstate 2 : 0||
> pstate 0 : 15239|*   |
> pstate 1 : 1521 ||
> pstate 2 : 3188 |*   |
> pstate 3 : 1836 ||
> pstate 4 : 94   ||
> 
> CPU 1
> State: Duration(ms)  Distribution
> cstate 0 : 87   ||
> cstate 1 : 16264|**  |
> cstate 2 : 50458|*** |
> pstate 0 : 832  ||
> pstate 1 : 131  ||
> pstate 2 : 825  ||
> pstate 3 : 787  ||
> pstate 4 : 4||
> 
> CPU 2
> State: Duration(ms)  Distribution
> cstate 0 : 177  ||
> cstate 1 : 9363 |*   |
> cstate 2 : 55835|*** |
> pstate 0 : 1468 ||
> pstate 1 : 350  ||
> pstate 2 : 1062 ||
> pstate 3 : 1164 ||
> pstate 4 : 7||

The output gets very long as the number of CPUs grow...
What about using the following output:

state(ms)  cstate-0  cstate-1  cstate-2  pstate-0  pstate-1  pstate-2  pstate-3 
 pstate-4
CPU-047,555 0 015,239 1,521 1,836 1,836 
   94
CPU-18716,26450,458   832   131   825   787 
4
CPU-2   177 9,36355,835 1,468   350 1,062 1,164 
7

Look at the code samples/bpf/xdp_redirect_cpu_user.c for an examples of
howto align the columns, and the trick to get printf to pretty print
with thousands separators use %' and setlocale(LC_NUMERIC, "en_US").


P.S. net-next and bpf-next is closed at the moment.
 Next time you submit read[1] and [2], especially howto indicate which
tree (bpf vs. bpf-next) the patch is against, as this helps the
workflow of the maintainers.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/Documentation/bpf/bpf_devel_QA.txt
[2] Documentation/networking/netdev-FAQ.txt 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] samples/bpf: Add program for CPU state statistics

2018-01-31 Thread Jesper Dangaard Brouer
On Wed, 31 Jan 2018 02:29:59 +0800
Leo Yan  wrote:

> CPU 0
> State: Duration(ms)  Distribution
> cstate 0 : 47555|*   |
> cstate 1 : 0||
> cstate 2 : 0||
> pstate 0 : 15239|*   |
> pstate 1 : 1521 ||
> pstate 2 : 3188 |*   |
> pstate 3 : 1836 ||
> pstate 4 : 94   ||
> 
> CPU 1
> State: Duration(ms)  Distribution
> cstate 0 : 87   ||
> cstate 1 : 16264|**  |
> cstate 2 : 50458|*** |
> pstate 0 : 832  ||
> pstate 1 : 131  ||
> pstate 2 : 825  ||
> pstate 3 : 787  ||
> pstate 4 : 4||
> 
> CPU 2
> State: Duration(ms)  Distribution
> cstate 0 : 177  ||
> cstate 1 : 9363 |*   |
> cstate 2 : 55835|*** |
> pstate 0 : 1468 ||
> pstate 1 : 350  ||
> pstate 2 : 1062 ||
> pstate 3 : 1164 ||
> pstate 4 : 7||

The output gets very long as the number of CPUs grow...
What about using the following output:

state(ms)  cstate-0  cstate-1  cstate-2  pstate-0  pstate-1  pstate-2  pstate-3 
 pstate-4
CPU-047,555 0 015,239 1,521 1,836 1,836 
   94
CPU-18716,26450,458   832   131   825   787 
4
CPU-2   177 9,36355,835 1,468   350 1,062 1,164 
7

Look at the code samples/bpf/xdp_redirect_cpu_user.c for an examples of
howto align the columns, and the trick to get printf to pretty print
with thousands separators use %' and setlocale(LC_NUMERIC, "en_US").


P.S. net-next and bpf-next is closed at the moment.
 Next time you submit read[1] and [2], especially howto indicate which
tree (bpf vs. bpf-next) the patch is against, as this helps the
workflow of the maintainers.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/Documentation/bpf/bpf_devel_QA.txt
[2] Documentation/networking/netdev-FAQ.txt 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[bpf-next PATCH 2/3] libbpf: cleanup Makefile, remove unused elements

2018-01-16 Thread Jesper Dangaard Brouer
The plugin_dir_SQ variable is not used, remove it.
The function update_dir is also unused, remove it.
The variable $VERSION_FILES is empty, remove it.

These all originates from the introduction of the Makefile, and is likely a 
copy paste
from tools/lib/traceevent/Makefile.

Fixes: 1b76c13e4b36 ("bpf tools: Introduce 'bpf' library and add bpf feature 
check")
Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
---
 tools/lib/bpf/Makefile |   15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 54370654c708..8e15e48cb8f8 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -93,7 +93,6 @@ export prefix libdir src obj
 # Shell quotes
 libdir_SQ = $(subst ','\'',$(libdir))
 libdir_relative_SQ = $(subst ','\'',$(libdir_relative))
-plugin_dir_SQ = $(subst ','\'',$(plugin_dir))
 
 LIB_FILE = libbpf.a libbpf.so
 
@@ -150,7 +149,7 @@ CMD_TARGETS = $(LIB_FILE)
 
 TARGETS = $(CMD_TARGETS)
 
-all: fixdep $(VERSION_FILES) all_cmd
+all: fixdep all_cmd
 
 all_cmd: $(CMD_TARGETS)
 
@@ -169,16 +168,6 @@ $(OUTPUT)libbpf.so: $(BPF_IN)
 $(OUTPUT)libbpf.a: $(BPF_IN)
$(QUIET_LINK)$(RM) $@; $(AR) rcs $@ $^
 
-define update_dir
-  (echo $1 > $@.tmp;   \
-   if [ -r $@ ] && cmp -s $@ $@.tmp; then  \
- rm -f $@.tmp; \
-   else\
- echo '  UPDATE $@';   \
- mv -f $@.tmp $@;  \
-   fi);
-endef
-
 define do_install
if [ ! -d '$(DESTDIR_SQ)$2' ]; then \
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$2'; \
@@ -204,7 +193,7 @@ config-clean:
$(Q)$(MAKE) -C $(srctree)/tools/build/feature/ clean >/dev/null
 
 clean:
-   $(call QUIET_CLEAN, libbpf) $(RM) *.o *~ $(TARGETS) *.a *.so 
$(VERSION_FILES) .*.d .*.cmd \
+   $(call QUIET_CLEAN, libbpf) $(RM) *.o *~ $(TARGETS) *.a *.so .*.d 
.*.cmd \
$(RM) LIBBPF-CFLAGS
$(call QUIET_CLEAN, core-gen) $(RM) $(OUTPUT)FEATURE-DUMP.libbpf
 



[bpf-next PATCH 2/3] libbpf: cleanup Makefile, remove unused elements

2018-01-16 Thread Jesper Dangaard Brouer
The plugin_dir_SQ variable is not used, remove it.
The function update_dir is also unused, remove it.
The variable $VERSION_FILES is empty, remove it.

These all originates from the introduction of the Makefile, and is likely a 
copy paste
from tools/lib/traceevent/Makefile.

Fixes: 1b76c13e4b36 ("bpf tools: Introduce 'bpf' library and add bpf feature 
check")
Signed-off-by: Jesper Dangaard Brouer 
---
 tools/lib/bpf/Makefile |   15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 54370654c708..8e15e48cb8f8 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -93,7 +93,6 @@ export prefix libdir src obj
 # Shell quotes
 libdir_SQ = $(subst ','\'',$(libdir))
 libdir_relative_SQ = $(subst ','\'',$(libdir_relative))
-plugin_dir_SQ = $(subst ','\'',$(plugin_dir))
 
 LIB_FILE = libbpf.a libbpf.so
 
@@ -150,7 +149,7 @@ CMD_TARGETS = $(LIB_FILE)
 
 TARGETS = $(CMD_TARGETS)
 
-all: fixdep $(VERSION_FILES) all_cmd
+all: fixdep all_cmd
 
 all_cmd: $(CMD_TARGETS)
 
@@ -169,16 +168,6 @@ $(OUTPUT)libbpf.so: $(BPF_IN)
 $(OUTPUT)libbpf.a: $(BPF_IN)
$(QUIET_LINK)$(RM) $@; $(AR) rcs $@ $^
 
-define update_dir
-  (echo $1 > $@.tmp;   \
-   if [ -r $@ ] && cmp -s $@ $@.tmp; then  \
- rm -f $@.tmp; \
-   else\
- echo '  UPDATE $@';   \
- mv -f $@.tmp $@;  \
-   fi);
-endef
-
 define do_install
if [ ! -d '$(DESTDIR_SQ)$2' ]; then \
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$2'; \
@@ -204,7 +193,7 @@ config-clean:
$(Q)$(MAKE) -C $(srctree)/tools/build/feature/ clean >/dev/null
 
 clean:
-   $(call QUIET_CLEAN, libbpf) $(RM) *.o *~ $(TARGETS) *.a *.so 
$(VERSION_FILES) .*.d .*.cmd \
+   $(call QUIET_CLEAN, libbpf) $(RM) *.o *~ $(TARGETS) *.a *.so .*.d 
.*.cmd \
$(RM) LIBBPF-CFLAGS
$(call QUIET_CLEAN, core-gen) $(RM) $(OUTPUT)FEATURE-DUMP.libbpf
 



[bpf-next PATCH 0/3] libbpf: cleanups to Makefile

2018-01-16 Thread Jesper Dangaard Brouer
This patchset contains some small improvements and cleanup for
the Makefile in tools/lib/bpf/.

It worries me that the libbpf.so shared library is not versioned, but
it not addressed in this patchset.

---

Jesper Dangaard Brouer (3):
  libbpf: install the header file libbpf.h
  libbpf: cleanup Makefile, remove unused elements
  libbpf: Makefile set specified permission mode


 tools/lib/bpf/Makefile |   20 +---
 1 file changed, 5 insertions(+), 15 deletions(-)

--


[bpf-next PATCH 0/3] libbpf: cleanups to Makefile

2018-01-16 Thread Jesper Dangaard Brouer
This patchset contains some small improvements and cleanup for
the Makefile in tools/lib/bpf/.

It worries me that the libbpf.so shared library is not versioned, but
it not addressed in this patchset.

---

Jesper Dangaard Brouer (3):
  libbpf: install the header file libbpf.h
  libbpf: cleanup Makefile, remove unused elements
  libbpf: Makefile set specified permission mode


 tools/lib/bpf/Makefile |   20 +---
 1 file changed, 5 insertions(+), 15 deletions(-)

--


[bpf-next PATCH 3/3] libbpf: Makefile set specified permission mode

2018-01-16 Thread Jesper Dangaard Brouer
The third parameter to do_install was not used by $(INSTALL) command.
Fix this by only setting the -m option when the third parameter is supplied.

The use of a third parameter was introduced in commit  eb54e522a000 ("bpf:
install libbpf headers on 'make install'").

Without this change, the header files are install as executables files (755).

Fixes: eb54e522a000 ("bpf: install libbpf headers on 'make install'")
Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
---
 tools/lib/bpf/Makefile |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 8e15e48cb8f8..83714ca1f22b 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -172,7 +172,7 @@ define do_install
if [ ! -d '$(DESTDIR_SQ)$2' ]; then \
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$2'; \
fi; \
-   $(INSTALL) $1 '$(DESTDIR_SQ)$2'
+   $(INSTALL) $1 $(if $3,-m $3,) '$(DESTDIR_SQ)$2'
 endef
 
 install_lib: all_cmd



[bpf-next PATCH 1/3] libbpf: install the header file libbpf.h

2018-01-16 Thread Jesper Dangaard Brouer
It seems like an oversight not to install the header file for libbpf,
given the libbpf.so + libbpf.a files are installed.

Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
---
 tools/lib/bpf/Makefile |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 8ed43ae9db9b..54370654c708 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -192,7 +192,8 @@ install_lib: all_cmd
 
 install_headers:
$(call QUIET_INSTALL, headers) \
-   $(call do_install,bpf.h,$(prefix)/include/bpf,644)
+   $(call do_install,bpf.h,$(prefix)/include/bpf,644); \
+   $(call do_install,libbpf.h,$(prefix)/include/bpf,644);
 
 install: install_lib
 



[bpf-next PATCH 3/3] libbpf: Makefile set specified permission mode

2018-01-16 Thread Jesper Dangaard Brouer
The third parameter to do_install was not used by $(INSTALL) command.
Fix this by only setting the -m option when the third parameter is supplied.

The use of a third parameter was introduced in commit  eb54e522a000 ("bpf:
install libbpf headers on 'make install'").

Without this change, the header files are install as executables files (755).

Fixes: eb54e522a000 ("bpf: install libbpf headers on 'make install'")
Signed-off-by: Jesper Dangaard Brouer 
---
 tools/lib/bpf/Makefile |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 8e15e48cb8f8..83714ca1f22b 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -172,7 +172,7 @@ define do_install
if [ ! -d '$(DESTDIR_SQ)$2' ]; then \
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$2'; \
fi; \
-   $(INSTALL) $1 '$(DESTDIR_SQ)$2'
+   $(INSTALL) $1 $(if $3,-m $3,) '$(DESTDIR_SQ)$2'
 endef
 
 install_lib: all_cmd



[bpf-next PATCH 1/3] libbpf: install the header file libbpf.h

2018-01-16 Thread Jesper Dangaard Brouer
It seems like an oversight not to install the header file for libbpf,
given the libbpf.so + libbpf.a files are installed.

Signed-off-by: Jesper Dangaard Brouer 
---
 tools/lib/bpf/Makefile |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 8ed43ae9db9b..54370654c708 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -192,7 +192,8 @@ install_lib: all_cmd
 
 install_headers:
$(call QUIET_INSTALL, headers) \
-   $(call do_install,bpf.h,$(prefix)/include/bpf,644)
+   $(call do_install,bpf.h,$(prefix)/include/bpf,644); \
+   $(call do_install,libbpf.h,$(prefix)/include/bpf,644);
 
 install: install_lib
 



Re: dvb usb issues since kernel 4.9

2018-01-10 Thread Jesper Dangaard Brouer

On Tue, 9 Jan 2018 10:58:30 -0800 Linus Torvalds 
<torva...@linux-foundation.org> wrote:

> So I really think "you can use up 90% of CPU time with a UDP packet
> flood from the same network" is very very very different - and
> honestly not at all as important - as "you want to be able to use a
> USB DVB receiver and watch/record TV".
> 
> Because that whole "UDP packet flood from the same network" really is
> something you _fundamentally_ have other mitigations for.
> 
> I bet that whole commit was introduced because of a benchmark test,
> rather than real life. No?

I believe this have happened in real-life.  In the form of DNS servers
not being able to recover after long outage, where DNS-TTL had timeout
causing legitimate traffic to overload their DNS servers.  The goodput
answers/sec from their DNS servers were too low, when bringing them
online again. (Based on talk over beer at NetDevConf from a guy
claiming they ran DNS for AWS).


The commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") tries to
address a fundamental problem that the network stack have when
interacting with softirq in overload situations.
(Maybe we can come up with a better solution?)

Before this commit, when application run on same CPU as softirq, the
kernel have a bad "drop off cliff" behavior, when reaching above the
saturation point.

This is confirmed in CloudFlare blogpost[1], which used a kernel that
predates this commit. From[1] section: "A note on NUMA performance"
Quote:"
  1. Run receiver on another CPU, but on the same NUMA node as the RX
 queue. The performance as we saw above is around 360kpps.

  2. With receiver on exactly same CPU as the RX queue we can get up to
 ~430kpps. But it creates high variability. The performance drops down
 to zero if the NIC is overwhelmed with packets."

The behavior problem here is "performance drops down to zero if the NIC
is overwhelmed with packets".  That is a bad way to handle overload.
Not only when attacked, but also when bringing a service online after
an outage.

What essentially happens is that:
 1. softirq NAPI enqueue 64 packets into socket. 
 2. application dequeue 1 packet and invoke local_bh_enable()
 3. causing softirq to run in app-timeslice, again enq 64 packets
 4. app only see goodput of 1/128 of packets

That is essentially what Eric solved with his commit, avoiding (3)
local_bh_enable() to invoke softirq if ksoftirqd is already running.

Maybe we can come up with a better solution?
(as I do agree this was a too big-hammer affecting other use-cases)


[1] https://blog.cloudflare.com/how-to-receive-a-million-packets/

p.s. Regarding quote[1] point "1.", after Paolo Abeni optimized the UDP
code, that statement is no longer true.  It now (significantly) faster to
run/pin your UDP application to another CPU than the RX-CPU.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: dvb usb issues since kernel 4.9

2018-01-10 Thread Jesper Dangaard Brouer

On Tue, 9 Jan 2018 10:58:30 -0800 Linus Torvalds 
 wrote:

> So I really think "you can use up 90% of CPU time with a UDP packet
> flood from the same network" is very very very different - and
> honestly not at all as important - as "you want to be able to use a
> USB DVB receiver and watch/record TV".
> 
> Because that whole "UDP packet flood from the same network" really is
> something you _fundamentally_ have other mitigations for.
> 
> I bet that whole commit was introduced because of a benchmark test,
> rather than real life. No?

I believe this have happened in real-life.  In the form of DNS servers
not being able to recover after long outage, where DNS-TTL had timeout
causing legitimate traffic to overload their DNS servers.  The goodput
answers/sec from their DNS servers were too low, when bringing them
online again. (Based on talk over beer at NetDevConf from a guy
claiming they ran DNS for AWS).


The commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") tries to
address a fundamental problem that the network stack have when
interacting with softirq in overload situations.
(Maybe we can come up with a better solution?)

Before this commit, when application run on same CPU as softirq, the
kernel have a bad "drop off cliff" behavior, when reaching above the
saturation point.

This is confirmed in CloudFlare blogpost[1], which used a kernel that
predates this commit. From[1] section: "A note on NUMA performance"
Quote:"
  1. Run receiver on another CPU, but on the same NUMA node as the RX
 queue. The performance as we saw above is around 360kpps.

  2. With receiver on exactly same CPU as the RX queue we can get up to
 ~430kpps. But it creates high variability. The performance drops down
 to zero if the NIC is overwhelmed with packets."

The behavior problem here is "performance drops down to zero if the NIC
is overwhelmed with packets".  That is a bad way to handle overload.
Not only when attacked, but also when bringing a service online after
an outage.

What essentially happens is that:
 1. softirq NAPI enqueue 64 packets into socket. 
 2. application dequeue 1 packet and invoke local_bh_enable()
 3. causing softirq to run in app-timeslice, again enq 64 packets
 4. app only see goodput of 1/128 of packets

That is essentially what Eric solved with his commit, avoiding (3)
local_bh_enable() to invoke softirq if ksoftirqd is already running.

Maybe we can come up with a better solution?
(as I do agree this was a too big-hammer affecting other use-cases)


[1] https://blog.cloudflare.com/how-to-receive-a-million-packets/

p.s. Regarding quote[1] point "1.", after Paolo Abeni optimized the UDP
code, that statement is no longer true.  It now (significantly) faster to
run/pin your UDP application to another CPU than the RX-CPU.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[net-next PATCH] net: fix xdp_rxq_info build issue when CONFIG_SYSFS is not set

2018-01-09 Thread Jesper Dangaard Brouer
The commit e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info")
removed some ifdef CONFIG_SYSFS in net/core/dev.c, but forgot to
remove the corresponding ifdef's in include/linux/netdevice.h.

Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info")
Reported-by: Guenter Roeck <li...@roeck-us.net>
Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
---
 include/linux/netdevice.h |3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 440b000f07f4..b308793c64ce 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1805,12 +1805,9 @@ struct net_device {
/* Interface address info used in eth_type_trans() */
unsigned char   *dev_addr;
 
-#ifdef CONFIG_SYSFS
struct netdev_rx_queue  *_rx;
-
unsigned intnum_rx_queues;
unsigned intreal_num_rx_queues;
-#endif
 
struct bpf_prog __rcu   *xdp_prog;
unsigned long   gro_flush_timeout;



[net-next PATCH] net: fix xdp_rxq_info build issue when CONFIG_SYSFS is not set

2018-01-09 Thread Jesper Dangaard Brouer
The commit e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info")
removed some ifdef CONFIG_SYSFS in net/core/dev.c, but forgot to
remove the corresponding ifdef's in include/linux/netdevice.h.

Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info")
Reported-by: Guenter Roeck 
Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/netdevice.h |3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 440b000f07f4..b308793c64ce 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1805,12 +1805,9 @@ struct net_device {
/* Interface address info used in eth_type_trans() */
unsigned char   *dev_addr;
 
-#ifdef CONFIG_SYSFS
struct netdev_rx_queue  *_rx;
-
unsigned intnum_rx_queues;
unsigned intreal_num_rx_queues;
-#endif
 
struct bpf_prog __rcu   *xdp_prog;
unsigned long   gro_flush_timeout;



Re: Build failure in -next due to 'xdp: generic XDP handling of xdp_rxq_info'

2018-01-09 Thread Jesper Dangaard Brouer
On Tue, 9 Jan 2018 11:01:47 -0800
Guenter Roeck <li...@roeck-us.net> wrote:

> Hi,
> 
> commit e817f85652c ("xdp: generic XDP handling of xdp_rxq_info") results in
> the following error when building m68k:m5208evb_defconfig in -next.
> 
> net/core/dev.c: In function 'netif_get_rxqueue':
> net/core/dev.c:3926:15: error: 'struct net_device' has no member named '_rx'
> net/core/dev.c:3931:28: error:
>   'struct net_device' has no member named 'real_num_rx_queues'
> 
> net/core/dev.c: In function 'netif_alloc_rx_queues':
> net/core/dev.c:7633:29: error:
>   'struct net_device' has no member named 'num_rx_queues'
> 
> [ and so on ]

Hi Guenter,

This is caused by CONFIG_SYSFS is not set in your config.  I'm
preparing a patch... can I ask you to test it, as I have a hard time
disabling on my X86_64 system.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Build failure in -next due to 'xdp: generic XDP handling of xdp_rxq_info'

2018-01-09 Thread Jesper Dangaard Brouer
On Tue, 9 Jan 2018 11:01:47 -0800
Guenter Roeck  wrote:

> Hi,
> 
> commit e817f85652c ("xdp: generic XDP handling of xdp_rxq_info") results in
> the following error when building m68k:m5208evb_defconfig in -next.
> 
> net/core/dev.c: In function 'netif_get_rxqueue':
> net/core/dev.c:3926:15: error: 'struct net_device' has no member named '_rx'
> net/core/dev.c:3931:28: error:
>   'struct net_device' has no member named 'real_num_rx_queues'
> 
> net/core/dev.c: In function 'netif_alloc_rx_queues':
> net/core/dev.c:7633:29: error:
>   'struct net_device' has no member named 'num_rx_queues'
> 
> [ and so on ]

Hi Guenter,

This is caused by CONFIG_SYSFS is not set in your config.  I'm
preparing a patch... can I ask you to test it, as I have a hard time
disabling on my X86_64 system.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: dvb usb issues since kernel 4.9

2018-01-09 Thread Jesper Dangaard Brouer

On Tue, 9 Jan 2018 15:42:35 -0200 Mauro Carvalho Chehab 
<mche...@s-opensource.com> wrote:
> Em Mon, 8 Jan 2018 11:51:04 -0800 Linus Torvalds 
> <torva...@linux-foundation.org> escreveu:
> 
[...]
> Patch makes sense to me, although I was not able to test it myself.
 
The patch also make sense to me.  I've done some basic testing with it
on my high-end Broadwell system (that I use for 100Gbit/s testing). As
expected the network overload case still works, as NET_RX_SOFTIRQ is
not matched. 

> I set a RPi3 machine here with vanilla Kernel 4.14.11 running a
> standard raspbian distribution (with elevator=deadline).

I found a Raspberry Pi Model B+ (I think, BCM2835), that I loaded the
LibreELEC distro on.  One of the guys even created an image for me with
a specific kernel[1] (that I just upgraded the system with).

[1] 
https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=77031#post77031
 
> My plan is to do more tests along this week, and try to tweak a little
> bit both userspace and kernelspace, in order to see if I can get
> better results.
 
I've previously experienced that you can be affected by the scheduler
granularity, which is adjustable (with CONFIG_SCHED_DEBUG=y):

 $ grep -H . /proc/sys/kernel/sched_*_granularity_ns
 /proc/sys/kernel/sched_min_granularity_ns:225
 /proc/sys/kernel/sched_wakeup_granularity_ns:300

The above numbers were confirmed on the RPi2 (see[2]). With commit
4cd13c21b207 ("softirq: Let ksoftirqd do its job"), I expect/assume that
softirq processing latency is bounded by the sched_wakeup_granularity_ns,
which with 3 ms is not good enough for their use-case.

Thus, if you manage to reproduce the case, try to see if adjusting this
can mitigate the issue...


Their system have non-preempt kernel, should they use PREEMPT?

 LibreELEC:~ # uname -a
 Linux LibreELEC 4.14.10 #1 SMP Tue Jan 9 17:35:03 GMT 2018 armv7l GNU/Linux

[2] 
https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=76999#post76999
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



Re: dvb usb issues since kernel 4.9

2018-01-09 Thread Jesper Dangaard Brouer

On Tue, 9 Jan 2018 15:42:35 -0200 Mauro Carvalho Chehab 
 wrote:
> Em Mon, 8 Jan 2018 11:51:04 -0800 Linus Torvalds 
>  escreveu:
> 
[...]
> Patch makes sense to me, although I was not able to test it myself.
 
The patch also make sense to me.  I've done some basic testing with it
on my high-end Broadwell system (that I use for 100Gbit/s testing). As
expected the network overload case still works, as NET_RX_SOFTIRQ is
not matched. 

> I set a RPi3 machine here with vanilla Kernel 4.14.11 running a
> standard raspbian distribution (with elevator=deadline).

I found a Raspberry Pi Model B+ (I think, BCM2835), that I loaded the
LibreELEC distro on.  One of the guys even created an image for me with
a specific kernel[1] (that I just upgraded the system with).

[1] 
https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=77031#post77031
 
> My plan is to do more tests along this week, and try to tweak a little
> bit both userspace and kernelspace, in order to see if I can get
> better results.
 
I've previously experienced that you can be affected by the scheduler
granularity, which is adjustable (with CONFIG_SCHED_DEBUG=y):

 $ grep -H . /proc/sys/kernel/sched_*_granularity_ns
 /proc/sys/kernel/sched_min_granularity_ns:225
 /proc/sys/kernel/sched_wakeup_granularity_ns:300

The above numbers were confirmed on the RPi2 (see[2]). With commit
4cd13c21b207 ("softirq: Let ksoftirqd do its job"), I expect/assume that
softirq processing latency is bounded by the sched_wakeup_granularity_ns,
which with 3 ms is not good enough for their use-case.

Thus, if you manage to reproduce the case, try to see if adjusting this
can mitigate the issue...


Their system have non-preempt kernel, should they use PREEMPT?

 LibreELEC:~ # uname -a
 Linux LibreELEC 4.14.10 #1 SMP Tue Jan 9 17:35:03 GMT 2018 armv7l GNU/Linux

[2] 
https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=76999#post76999
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer
On Mon, 8 Jan 2018 22:44:27 +0100
Peter Zijlstra <pet...@infradead.org> wrote:

> On Mon, Jan 08, 2018 at 10:31:09PM +0100, Jesper Dangaard Brouer wrote:
> > I did expected the issue to get worse, when you load the Pi with
> > network traffic, as now the softirq time-budget have to be shared
> > between networking and USB/DVB. Thus, I guess you are running TCP and
> > USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...)  
> 
> Isn't networking also over USB on the Pi ?

Darn, that is true. Looking at the dmesg output in http://ix.io/DOg:

[0.405942] usbcore: registered new interface driver smsc95xx
[5.821104] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 
0x45E1

I don't know enough about USB... is it possible to control which CPU
handles the individual USB ports, or on some other level (than ports)?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer
On Mon, 8 Jan 2018 22:44:27 +0100
Peter Zijlstra  wrote:

> On Mon, Jan 08, 2018 at 10:31:09PM +0100, Jesper Dangaard Brouer wrote:
> > I did expected the issue to get worse, when you load the Pi with
> > network traffic, as now the softirq time-budget have to be shared
> > between networking and USB/DVB. Thus, I guess you are running TCP and
> > USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...)  
> 
> Isn't networking also over USB on the Pi ?

Darn, that is true. Looking at the dmesg output in http://ix.io/DOg:

[0.405942] usbcore: registered new interface driver smsc95xx
[5.821104] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 
0x45E1

I don't know enough about USB... is it possible to control which CPU
handles the individual USB ports, or on some other level (than ports)?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer

On Mon, 8 Jan 2018 17:26:10 +0100
"Josef Griebichler" <griebichler.jo...@gmx.at> wrote:

> I tried your mentioned patch but unfortunately no real improvement for me.
> dmesg http://ix.io/DOg
> tvheadend service log http://ix.io/DOi
>
> Errors during recording are still there.

Are you _also_ recording the stream on the Raspberry Pi?

It seems to me, that you are expecting too much from this small device.

> Errors increase if there is additional tcp load on raspberry.

I did expected the issue to get worse, when you load the Pi with
network traffic, as now the softirq time-budget have to be shared
between networking and USB/DVB. Thus, I guess you are running TCP and
USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...)

If you expect/want to get stable performance out of such a small box,
then you (or LibreELEC) need to tune the box for this usage.  And it
does not have to be that complicated.  First step is to move IRQ
handling for the NIC to another CPU and than the USB port handling the
DVB signal (/proc/irq/*/smp_affinity_list).  And then pin the
userspace process (taskset) to another CPU than the one handling
USB-softirq.

> Unfortunately there's no usbmon or tshark on libreelec so I can't
> provide further logs.

Do you have perf or trace-cmd on the box?  Maybe we could come up with
some kernel functions to trace, to measure/show the latency spikes?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer

On Mon, 8 Jan 2018 17:26:10 +0100
"Josef Griebichler"  wrote:

> I tried your mentioned patch but unfortunately no real improvement for me.
> dmesg http://ix.io/DOg
> tvheadend service log http://ix.io/DOi
>
> Errors during recording are still there.

Are you _also_ recording the stream on the Raspberry Pi?

It seems to me, that you are expecting too much from this small device.

> Errors increase if there is additional tcp load on raspberry.

I did expected the issue to get worse, when you load the Pi with
network traffic, as now the softirq time-budget have to be shared
between networking and USB/DVB. Thus, I guess you are running TCP and
USB/mpeg2ts on the same CPU (why when you have 4 CPUs?...)

If you expect/want to get stable performance out of such a small box,
then you (or LibreELEC) need to tune the box for this usage.  And it
does not have to be that complicated.  First step is to move IRQ
handling for the NIC to another CPU and than the USB port handling the
DVB signal (/proc/irq/*/smp_affinity_list).  And then pin the
userspace process (taskset) to another CPU than the one handling
USB-softirq.

> Unfortunately there's no usbmon or tshark on libreelec so I can't
> provide further logs.

Do you have perf or trace-cmd on the box?  Maybe we could come up with
some kernel functions to trace, to measure/show the latency spikes?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer


On Mon, 8 Jan 2018 12:35:08 -0500 (EST) Alan Stern <st...@rowland.harvard.edu> 
wrote:

> On Mon, 8 Jan 2018, Josef Griebichler wrote:
> 
> > No I can't sorry. There's no sat connection near to my workstation.  
> 
> Can we ask the person who made this post:
> https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=75965#post75965
> 
> to run the test?  The post says that the testing was done on an x86_64 
> machine.

For >5 years ago I used to play a lot with IPTV multicast MPEG2-TS
streams (I implemented the wireshark mp2ts drop detecting, and a
out-of-tree netfilter kernel module to detect drops[1]). The web-site
is dead, but archive.org have a copy[2].

Let me quote my own Lab-setup documentation[3].

You don't need a live IPTV MPEG2TS signal, you can simply generate your
own using VLC:

 $ vlc ~/Videos/test_video.mkv -I rc --sout 
'#duplicate{dst=std{access=udp,mux=ts,dst=239.254.1.1:5500}}'

Viewing your own signal: You can view your own generated signal, again,
by using VLC.

 $ vlc udp/ts://@239.254.1.1:5500

I hope the vlc syntax is still valid.  And remember to join the
multicast channels, if you don't have an application requesting the
stream, as desc in [4].


[1] https://github.com/netoptimizer/IPTV-Analyzer
[2] 
http://web.archive.org/web/20150328200122/http://www.iptv-analyzer.org:80/wiki/index.php/Main_Page
[3] 
http://web.archive.org/web/20150329095538/http://www.iptv-analyzer.org:80/wiki/index.php/Lab_Setup
[4] 
http://web.archive.org/web/20150328234459/http://www.iptv-analyzer.org:80/wiki/index.php/Multicast_Signal_on_Linux
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer


On Mon, 8 Jan 2018 12:35:08 -0500 (EST) Alan Stern  
wrote:

> On Mon, 8 Jan 2018, Josef Griebichler wrote:
> 
> > No I can't sorry. There's no sat connection near to my workstation.  
> 
> Can we ask the person who made this post:
> https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/?postID=75965#post75965
> 
> to run the test?  The post says that the testing was done on an x86_64 
> machine.

For >5 years ago I used to play a lot with IPTV multicast MPEG2-TS
streams (I implemented the wireshark mp2ts drop detecting, and a
out-of-tree netfilter kernel module to detect drops[1]). The web-site
is dead, but archive.org have a copy[2].

Let me quote my own Lab-setup documentation[3].

You don't need a live IPTV MPEG2TS signal, you can simply generate your
own using VLC:

 $ vlc ~/Videos/test_video.mkv -I rc --sout 
'#duplicate{dst=std{access=udp,mux=ts,dst=239.254.1.1:5500}}'

Viewing your own signal: You can view your own generated signal, again,
by using VLC.

 $ vlc udp/ts://@239.254.1.1:5500

I hope the vlc syntax is still valid.  And remember to join the
multicast channels, if you don't have an application requesting the
stream, as desc in [4].


[1] https://github.com/netoptimizer/IPTV-Analyzer
[2] 
http://web.archive.org/web/20150328200122/http://www.iptv-analyzer.org:80/wiki/index.php/Main_Page
[3] 
http://web.archive.org/web/20150329095538/http://www.iptv-analyzer.org:80/wiki/index.php/Lab_Setup
[4] 
http://web.archive.org/web/20150328234459/http://www.iptv-analyzer.org:80/wiki/index.php/Multicast_Signal_on_Linux
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer
On Mon, 8 Jan 2018 08:02:00 -0200
Mauro Carvalho Chehab <mche...@s-opensource.com> wrote:

> Hi Linus,
> 
> Em Sun, 7 Jan 2018 13:23:39 -0800
> Linus Torvalds <torva...@linux-foundation.org> escreveu:
> 
> > On Sat, Jan 6, 2018 at 11:54 AM, Mauro Carvalho Chehab
> > <mche...@s-opensource.com> wrote:  
> > >
> > > Em Sat, 6 Jan 2018 16:04:16 +0100
> > > "Josef Griebichler" <griebichler.jo...@gmx.at> escreveu:
> > >>
> > >> the causing commit has been identified.
> > >> After reverting commit 
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882
> > >> its working again.
> > >
> > > Just replying to me won't magically fix this. The ones that were involved 
> > > on
> > > this patch should also be c/c, plus USB people. Just added them.
> > 
> > Actually, you seem to have added an odd subset of the people involved.
> > 
> > For example, Ingo - who actually committed that patch - wasn't on the cc.  
> 
> Sorry, my fault. I forgot to add him to it.
> 
> > I do think we need to simply revert that patch. It's very simple: it
> > has been reported to lead to actual problems for people, and we don't
> > fix one problem and then say "well, it fixed something else" when
> > something breaks.
> > 
> > When something breaks, we either unbreak it, or we revert the change
> > that caused the breakage.
> > 
> > It's really that simple. That's what "no regressions" means.  We don't
> > accept changes that cause regressions. This one did.  
> 
> Yeah, we should either unbreak or revert it. In the specific case of
> media devices, Alan came with a proposal of increasing the number of
> buffers. This is an one line change, and increase a capture delay from
> 0.63 ms to 5 ms on this specific case (Digital TV) shouldn't make much
> harm. So, I guess it would worth trying it before reverting the patch.

Let find the root-cause of this before reverting, as this will hurt the
networking use-case.

I want to see if the increase buffer will solve the issue (the current
buffer of 0.63 ms seem too small). 

I would also like to see experiments with adjusting adjust the sched
priority of the kthread's and/or the userspace prog. (e.g use command
like 'sudo chrt --fifo -p 10 $(pgrep udp_sink)' ).


Are we really sure that the regression is cause by 4cd13c21b207
("softirq: Let ksoftirqd do its job"), the forum thread also report
that the problem is almost gone after commit 34f41c0316ed ("timers: Fix
overflow in get_next_timer_interrupt")
 https://git.kernel.org/torvalds/c/34f41c0316ed

It makes me suspicious that this fix changes things...
After this fix, I suspect that changing the sched priorities, will fix
the remaining glitches.


> It is hard to foresee the consequences of the softirq changes for other
> devices, though.

Yes, it is hard to foresee, I can only cover networking.

For networking, if reverting this, we will (again) open the kernel for
an easy DDoS vector with UDP packets.  As mentioned in the commit desc,
before you could easily cause softirq to take all the CPU time from the
application, resulting in very low "good-put" in the UDP-app. (That's why
it was so easy to DDoS DNS servers before...)

With the softirqd patch in place, ksoftirqd is scheduled fairly between
other applications running on the same CPU.  But in some cases this is
not what you want, so as the also commit mentions, the admin can now
more easily tune process scheduling parameters if needed, to adjust for
such use-cases (it was not really an admin choice before).


> For example, we didn't have any reports about this issue affecting cameras,
> Most cameras use ISOC nowadays, but some only provide bulk transfers.
> We usually try to use the minimum number of buffers possible, as
> increasing latency on cameras can be very annoying, specially on
> videoconference applications.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: dvb usb issues since kernel 4.9

2018-01-08 Thread Jesper Dangaard Brouer
On Mon, 8 Jan 2018 08:02:00 -0200
Mauro Carvalho Chehab  wrote:

> Hi Linus,
> 
> Em Sun, 7 Jan 2018 13:23:39 -0800
> Linus Torvalds  escreveu:
> 
> > On Sat, Jan 6, 2018 at 11:54 AM, Mauro Carvalho Chehab
> >  wrote:  
> > >
> > > Em Sat, 6 Jan 2018 16:04:16 +0100
> > > "Josef Griebichler"  escreveu:
> > >>
> > >> the causing commit has been identified.
> > >> After reverting commit 
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4cd13c21b207e80ddb1144c576500098f2d5f882
> > >> its working again.
> > >
> > > Just replying to me won't magically fix this. The ones that were involved 
> > > on
> > > this patch should also be c/c, plus USB people. Just added them.
> > 
> > Actually, you seem to have added an odd subset of the people involved.
> > 
> > For example, Ingo - who actually committed that patch - wasn't on the cc.  
> 
> Sorry, my fault. I forgot to add him to it.
> 
> > I do think we need to simply revert that patch. It's very simple: it
> > has been reported to lead to actual problems for people, and we don't
> > fix one problem and then say "well, it fixed something else" when
> > something breaks.
> > 
> > When something breaks, we either unbreak it, or we revert the change
> > that caused the breakage.
> > 
> > It's really that simple. That's what "no regressions" means.  We don't
> > accept changes that cause regressions. This one did.  
> 
> Yeah, we should either unbreak or revert it. In the specific case of
> media devices, Alan came with a proposal of increasing the number of
> buffers. This is an one line change, and increase a capture delay from
> 0.63 ms to 5 ms on this specific case (Digital TV) shouldn't make much
> harm. So, I guess it would worth trying it before reverting the patch.

Let find the root-cause of this before reverting, as this will hurt the
networking use-case.

I want to see if the increase buffer will solve the issue (the current
buffer of 0.63 ms seem too small). 

I would also like to see experiments with adjusting adjust the sched
priority of the kthread's and/or the userspace prog. (e.g use command
like 'sudo chrt --fifo -p 10 $(pgrep udp_sink)' ).


Are we really sure that the regression is cause by 4cd13c21b207
("softirq: Let ksoftirqd do its job"), the forum thread also report
that the problem is almost gone after commit 34f41c0316ed ("timers: Fix
overflow in get_next_timer_interrupt")
 https://git.kernel.org/torvalds/c/34f41c0316ed

It makes me suspicious that this fix changes things...
After this fix, I suspect that changing the sched priorities, will fix
the remaining glitches.


> It is hard to foresee the consequences of the softirq changes for other
> devices, though.

Yes, it is hard to foresee, I can only cover networking.

For networking, if reverting this, we will (again) open the kernel for
an easy DDoS vector with UDP packets.  As mentioned in the commit desc,
before you could easily cause softirq to take all the CPU time from the
application, resulting in very low "good-put" in the UDP-app. (That's why
it was so easy to DDoS DNS servers before...)

With the softirqd patch in place, ksoftirqd is scheduled fairly between
other applications running on the same CPU.  But in some cases this is
not what you want, so as the also commit mentions, the admin can now
more easily tune process scheduling parameters if needed, to adjust for
such use-cases (it was not really an admin choice before).


> For example, we didn't have any reports about this issue affecting cameras,
> Most cameras use ISOC nowadays, but some only provide bulk transfers.
> We usually try to use the minimum number of buffers possible, as
> increasing latency on cameras can be very annoying, specially on
> videoconference applications.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next 2/2] tuntap: XDP transmission

2017-12-29 Thread Jesper Dangaard Brouer
On Fri, 29 Dec 2017 18:00:04 +0800
Jason Wang <jasow...@redhat.com> wrote:

> This patch implements XDP transmission for TAP. Since we can't create
> new queues for TAP during XDP set, exist ptr_ring was reused for
> queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
> (0x1ULL) was encoded into lowest bit of xpd_buff pointer during
> ptr_ring_produce, and was decoded during consuming. XDP metadata was
> stored in the headroom of the packet which should work in most of
> cases since driver usually reserve enough headroom. Very minor changes
> were done for vhost_net: it just need to peek the length depends on
> the type of pointer.
> 
> Tests was done on two Intel E5-2630 2.40GHz machines connected back to
> back through two 82599ES. Traffic were generated through MoonGen and
> testpmd(rxonly) in guest reports 2.97Mpps when xdp_redirect_map is
> doing redirection from ixgbe to TAP.

IMHO a performance measurement without something to compare against is
useless.  What was the performance before?


> Cc: Jesper Dangaard Brouer <bro...@redhat.com>
> Signed-off-by: Jason Wang <jasow...@redhat.com>
> ---
>  drivers/net/tun.c  | 205 
> -
>  drivers/vhost/net.c|  13 +++-
>  include/linux/if_tun.h |  17 
>  3 files changed, 197 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 2c89efe..be6d993 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -240,6 +240,24 @@ struct tun_struct {
>   struct tun_steering_prog __rcu *steering_prog;
>  };
>  
> +bool tun_is_xdp_buff(void *ptr)
> +{
> + return (unsigned long)ptr & TUN_XDP_FLAG;
> +}
> +EXPORT_SYMBOL(tun_is_xdp_buff);
> +
> +void *tun_xdp_to_ptr(void *ptr)
> +{
> + return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
> +}
> +EXPORT_SYMBOL(tun_xdp_to_ptr);
> +
> +void *tun_ptr_to_xdp(void *ptr)
> +{
> + return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
> +}
> +EXPORT_SYMBOL(tun_ptr_to_xdp);
> +
>  static int tun_napi_receive(struct napi_struct *napi, int budget)
>  {
>   struct tun_file *tfile = container_of(napi, struct tun_file, napi);
> @@ -630,12 +648,25 @@ static struct tun_struct *tun_enable_queue(struct 
> tun_file *tfile)
>   return tun;
>  }
>  
> +static void tun_ptr_free(void *ptr)
> +{
> + if (!ptr)
> + return;
> + if (tun_is_xdp_buff(ptr)) {
> + struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
> +
> + put_page(virt_to_head_page(xdp->data));

(Yet another XDP-free call point, I need to convert to use an accessor
later to transition driver into an xdp_buff return API)

> + } else {
> + __skb_array_destroy_skb(ptr);
> + }
> +}
> +
>  static void tun_queue_purge(struct tun_file *tfile)
>  {
> - struct sk_buff *skb;
> + void *ptr;
>  
> - while ((skb = ptr_ring_consume(>tx_ring)) != NULL)
> - kfree_skb(skb);
> + while ((ptr = ptr_ring_consume(>tx_ring)) != NULL)
> + tun_ptr_free(ptr);
>  
>   skb_queue_purge(>sk.sk_write_queue);
>   skb_queue_purge(>sk.sk_error_queue);
> @@ -688,8 +719,7 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
>   unregister_netdevice(tun->dev);
>   }
>   if (tun)
> - ptr_ring_cleanup(>tx_ring,
> -  __skb_array_destroy_skb);
> + ptr_ring_cleanup(>tx_ring, tun_ptr_free);
>   sock_put(>sk);
>   }
>  }
> @@ -1201,6 +1231,54 @@ static const struct net_device_ops tun_netdev_ops = {
>   .ndo_get_stats64= tun_net_get_stats64,
>  };
>  
> +static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
> +{
> + struct tun_struct *tun = netdev_priv(dev);
> + struct xdp_buff *buff = xdp->data_hard_start;
> + int headroom = xdp->data - xdp->data_hard_start;
> + struct tun_file *tfile;
> + u32 numqueues;
> + int ret = 0;
> +
> + /* Assure headroom is available and buff is properly aligned */
> + if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
> + return -ENOSPC;
> +
> + *buff = *xdp;
> +
> + rcu_read_lock();
> +
> + numqueues = READ_ONCE(tun->numqueues);
> + if (!numqueues) {
> + ret = -ENOSPC;
> + goto out;
> + }
> + tfile = rcu_dereference(tun->tfiles[smp_processor_id() %
> + numqueues]);

Several concurrent CPUs can get the same 'tfi

Re: [PATCH net-next 2/2] tuntap: XDP transmission

2017-12-29 Thread Jesper Dangaard Brouer
On Fri, 29 Dec 2017 18:00:04 +0800
Jason Wang  wrote:

> This patch implements XDP transmission for TAP. Since we can't create
> new queues for TAP during XDP set, exist ptr_ring was reused for
> queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
> (0x1ULL) was encoded into lowest bit of xpd_buff pointer during
> ptr_ring_produce, and was decoded during consuming. XDP metadata was
> stored in the headroom of the packet which should work in most of
> cases since driver usually reserve enough headroom. Very minor changes
> were done for vhost_net: it just need to peek the length depends on
> the type of pointer.
> 
> Tests was done on two Intel E5-2630 2.40GHz machines connected back to
> back through two 82599ES. Traffic were generated through MoonGen and
> testpmd(rxonly) in guest reports 2.97Mpps when xdp_redirect_map is
> doing redirection from ixgbe to TAP.

IMHO a performance measurement without something to compare against is
useless.  What was the performance before?


> Cc: Jesper Dangaard Brouer 
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/tun.c  | 205 
> -
>  drivers/vhost/net.c|  13 +++-
>  include/linux/if_tun.h |  17 
>  3 files changed, 197 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 2c89efe..be6d993 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -240,6 +240,24 @@ struct tun_struct {
>   struct tun_steering_prog __rcu *steering_prog;
>  };
>  
> +bool tun_is_xdp_buff(void *ptr)
> +{
> + return (unsigned long)ptr & TUN_XDP_FLAG;
> +}
> +EXPORT_SYMBOL(tun_is_xdp_buff);
> +
> +void *tun_xdp_to_ptr(void *ptr)
> +{
> + return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
> +}
> +EXPORT_SYMBOL(tun_xdp_to_ptr);
> +
> +void *tun_ptr_to_xdp(void *ptr)
> +{
> + return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
> +}
> +EXPORT_SYMBOL(tun_ptr_to_xdp);
> +
>  static int tun_napi_receive(struct napi_struct *napi, int budget)
>  {
>   struct tun_file *tfile = container_of(napi, struct tun_file, napi);
> @@ -630,12 +648,25 @@ static struct tun_struct *tun_enable_queue(struct 
> tun_file *tfile)
>   return tun;
>  }
>  
> +static void tun_ptr_free(void *ptr)
> +{
> + if (!ptr)
> + return;
> + if (tun_is_xdp_buff(ptr)) {
> + struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
> +
> + put_page(virt_to_head_page(xdp->data));

(Yet another XDP-free call point, I need to convert to use an accessor
later to transition driver into an xdp_buff return API)

> + } else {
> + __skb_array_destroy_skb(ptr);
> + }
> +}
> +
>  static void tun_queue_purge(struct tun_file *tfile)
>  {
> - struct sk_buff *skb;
> + void *ptr;
>  
> - while ((skb = ptr_ring_consume(>tx_ring)) != NULL)
> - kfree_skb(skb);
> + while ((ptr = ptr_ring_consume(>tx_ring)) != NULL)
> + tun_ptr_free(ptr);
>  
>   skb_queue_purge(>sk.sk_write_queue);
>   skb_queue_purge(>sk.sk_error_queue);
> @@ -688,8 +719,7 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
>   unregister_netdevice(tun->dev);
>   }
>   if (tun)
> - ptr_ring_cleanup(>tx_ring,
> -  __skb_array_destroy_skb);
> + ptr_ring_cleanup(>tx_ring, tun_ptr_free);
>   sock_put(>sk);
>   }
>  }
> @@ -1201,6 +1231,54 @@ static const struct net_device_ops tun_netdev_ops = {
>   .ndo_get_stats64= tun_net_get_stats64,
>  };
>  
> +static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
> +{
> + struct tun_struct *tun = netdev_priv(dev);
> + struct xdp_buff *buff = xdp->data_hard_start;
> + int headroom = xdp->data - xdp->data_hard_start;
> + struct tun_file *tfile;
> + u32 numqueues;
> + int ret = 0;
> +
> + /* Assure headroom is available and buff is properly aligned */
> + if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
> + return -ENOSPC;
> +
> + *buff = *xdp;
> +
> + rcu_read_lock();
> +
> + numqueues = READ_ONCE(tun->numqueues);
> + if (!numqueues) {
> + ret = -ENOSPC;
> + goto out;
> + }
> + tfile = rcu_dereference(tun->tfiles[smp_processor_id() %
> + numqueues]);

Several concurrent CPUs can get the same 'tfile'.

> + /* Encode the XDP flag into lowest bit for consumer to differ
>

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-20 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 18:56:51 -0600 (CST)
Christopher Lameter <c...@linux.com> wrote:

> On Tue, 19 Dec 2017, Rao Shoaib wrote:
> 
> > > > mm/slab_common.c  
> > > It would be great to have separate patches so that we can review it
> > > properly:
> > >
> > > 1. Move the code into slab_common.c
> > > 2. The actual code changes to the kfree rcu mechanism
> > > 3. The whitespace changes  
> 
> > I can certainly break down the patch and submit smaller patches as you have
> > suggested.
> >
> > BTW -- This is my first ever patch to Linux, so I am still learning the
> > etiquette.  
> 
> You are doing great. Keep at improving the patches and we will get your
> changes into the kernel source. If you want to sent your first patchset
> then a tool like "quilt" or "git" might be helpful.

When working with patchsets (multiple separate patches, as requested
here), I personally prefer using the tool called Stacked Git[1] (StGit)
command line 'stg', as it allows me to easily adjust patches in the
middle of the patchset "stack".  It is similar to quilt, just git based
itself.

I guess most people on this list use 'git rebase --interactive' when
updating their patchsets (?)

[1] http://procode.org/stgit/doc/tutorial.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-20 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 18:56:51 -0600 (CST)
Christopher Lameter  wrote:

> On Tue, 19 Dec 2017, Rao Shoaib wrote:
> 
> > > > mm/slab_common.c  
> > > It would be great to have separate patches so that we can review it
> > > properly:
> > >
> > > 1. Move the code into slab_common.c
> > > 2. The actual code changes to the kfree rcu mechanism
> > > 3. The whitespace changes  
> 
> > I can certainly break down the patch and submit smaller patches as you have
> > suggested.
> >
> > BTW -- This is my first ever patch to Linux, so I am still learning the
> > etiquette.  
> 
> You are doing great. Keep at improving the patches and we will get your
> changes into the kernel source. If you want to sent your first patchset
> then a tool like "quilt" or "git" might be helpful.

When working with patchsets (multiple separate patches, as requested
here), I personally prefer using the tool called Stacked Git[1] (StGit)
command line 'stg', as it allows me to easily adjust patches in the
middle of the patchset "stack".  It is similar to quilt, just git based
itself.

I guess most people on this list use 'git rebase --interactive' when
updating their patchsets (?)

[1] http://procode.org/stgit/doc/tutorial.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 13:20:43 -0800 Rao Shoaib <rao.sho...@oracle.com> wrote:

> On 12/19/2017 12:41 PM, Jesper Dangaard Brouer wrote:
> > On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:
> >  
> >> +/* Main RCU function that is called to free RCU structures */
> >> +static void
> >> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> >> lazy)
> >> +{
> >> +  unsigned long offset;
> >> +  void *ptr;
> >> +  struct rcu_bulk_free *rbf;
> >> +  struct rcu_bulk_free_container *rbfc = NULL;
> >> +
> >> +  rbf = this_cpu_ptr(_rbf);
> >> +
> >> +  if (unlikely(!rbf->rbf_init)) {
> >> +  spin_lock_init(>rbf_lock);
> >> +  rbf->rbf_cpu = smp_processor_id();
> >> +  rbf->rbf_init = true;
> >> +  }
> >> +
> >> +  /* hold lock to protect against other cpu's */
> >> +  spin_lock_bh(>rbf_lock);  
> >
> > I'm not sure this will be faster.  Having to take a cross CPU lock here
> > (+ BH-disable) could cause scaling issues.   Hopefully this lock will
> > not be used intensively by other CPUs, right?
> >
[...]
> 
> As Paul has pointed out the lock is a per-cpu lock, the only reason for 
> another CPU to access this lock is if the rcu callbacks run on a 
> different CPU and there is nothing the code can do to avoid that but 
> that should be rare anyways.

(loop in Paul's comment)
On Tue, 19 Dec 2017 12:56:29 -0800
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> Isn't this lock in a per-CPU object?  It -might- go cross-CPU in response
> to CPU-hotplug operations, but that should be rare.

Point taken.  If this lock is very unlikely to be taken on another CPU
then I withdraw my performance concerns (the cacheline can hopefully
stay in Modified(M) state on this CPU, and all other CPUs will have in
in Invalid(I) state based on MESI cache coherence protocol view[1]).

The lock's atomic operation does have some overhead, and _later_ if we
could get fancy and use seqlock (include/linux/seqlock.h) to remove
that.

[1] https://en.wikipedia.org/wiki/MESI_protocol
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 13:20:43 -0800 Rao Shoaib  wrote:

> On 12/19/2017 12:41 PM, Jesper Dangaard Brouer wrote:
> > On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:
> >  
> >> +/* Main RCU function that is called to free RCU structures */
> >> +static void
> >> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> >> lazy)
> >> +{
> >> +  unsigned long offset;
> >> +  void *ptr;
> >> +  struct rcu_bulk_free *rbf;
> >> +  struct rcu_bulk_free_container *rbfc = NULL;
> >> +
> >> +  rbf = this_cpu_ptr(_rbf);
> >> +
> >> +  if (unlikely(!rbf->rbf_init)) {
> >> +  spin_lock_init(>rbf_lock);
> >> +  rbf->rbf_cpu = smp_processor_id();
> >> +  rbf->rbf_init = true;
> >> +  }
> >> +
> >> +  /* hold lock to protect against other cpu's */
> >> +  spin_lock_bh(>rbf_lock);  
> >
> > I'm not sure this will be faster.  Having to take a cross CPU lock here
> > (+ BH-disable) could cause scaling issues.   Hopefully this lock will
> > not be used intensively by other CPUs, right?
> >
[...]
> 
> As Paul has pointed out the lock is a per-cpu lock, the only reason for 
> another CPU to access this lock is if the rcu callbacks run on a 
> different CPU and there is nothing the code can do to avoid that but 
> that should be rare anyways.

(loop in Paul's comment)
On Tue, 19 Dec 2017 12:56:29 -0800
"Paul E. McKenney"  wrote:

> Isn't this lock in a per-CPU object?  It -might- go cross-CPU in response
> to CPU-hotplug operations, but that should be rare.

Point taken.  If this lock is very unlikely to be taken on another CPU
then I withdraw my performance concerns (the cacheline can hopefully
stay in Modified(M) state on this CPU, and all other CPUs will have in
in Invalid(I) state based on MESI cache coherence protocol view[1]).

The lock's atomic operation does have some overhead, and _later_ if we
could get fancy and use seqlock (include/linux/seqlock.h) to remove
that.

[1] https://en.wikipedia.org/wiki/MESI_protocol
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 16:20:51 -0800
"Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:

> On Tue, Dec 19, 2017 at 02:12:06PM -0800, Matthew Wilcox wrote:
> > On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:  
> > > If I had to implement this: I would choose to do the optimization in
> > > __rcu_process_callbacks() create small on-call-stack ptr-array for
> > > kfree_bulk().  I would only optimize the case that call kfree()
> > > directly.  In the while(list) loop I would defer calling
> > > __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> > > them to the ptr-array (and flush if the array is full in loop, and
> > > kfree_bulk flush after loop).
> > > 
> > > The real advantage of kfree_bulk() comes from amortizing the per kfree
> > > (behind-the-scenes) sync cost.  There is an additional benefit, because
> > > objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> > > allocator is very fast for objects that gets recycled quickly (short
> > > lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> > > longer-lived/more-outstanding objects, as this hits a slower code-path,
> > > fully locked (cross-cpu) double-cmpxchg.
> > 
> > Something like this ...  (compile tested only)

Yes, exactly.

> > Considerably less code; Rao, what do you think?  
> 
> I am sorry, but I am not at all fan of this approach.
> 
> If we are going to make this sort of change, we should do so in a way
> that allows the slab code to actually do the optimizations that might
> make this sort of thing worthwhile.  After all, if the main goal was small
> code size, the best approach is to drop kfree_bulk() and get on with life
> in the usual fashion.
> 
> I would prefer to believe that something like kfree_bulk() can help,
> and if that is the case, we should give it a chance to do things like
> group kfree_rcu() requests by destination slab and soforth, allowing
> batching optimizations that might provide more significant increases
> in performance.  Furthermore, having this in slab opens the door to
> slab taking emergency action when memory is low.

I agree with your argument. Although in the (slub) code I do handle
different destination slab's, but only do a limited look-ahead to find
same dest-slab's which gives the speedup (see build_detached_freelist).

We do have a larger and more consistent speedup potential, if adding
infrastructure that allow us to pre-sort by destination slab, before
invoking kfree_bulk().  In that respect, Rao's patch is a better
approach.

> But for the patch below, NAK.
> 
>   Thanx, Paul
> 
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 59c471de342a..5ac4ed077233 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct 
> > rcu_head *head)
> >  }
> >  #endif /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> > 
> > -void kfree(const void *);
> > -
> >  /*
> >   * Reclaim the specified callback, either by invoking it (non-lazy case)
> >   * or freeing it directly (lazy case).  Return true if lazy, false 
> > otherwise.
> >   */
> > -static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
> > +static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, 
> > void **kfree,
> > +   unsigned int *idx)
> >  {
> > unsigned long offset = (unsigned long)head->func;
> > 
> > rcu_lock_acquire(_callback_map);
> > if (__is_kfree_rcu_offset(offset)) {
> > RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
> > -   kfree((void *)head - offset);
> > +   kfree[*idx++] = (void *)head - offset;
> > rcu_lock_release(_callback_map);
> > return true;
> > } else {
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index f9c0ca2ccf0c..7e13979b4697 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, 
> > struct rcu_data *rdp)
> > struct rcu_head *rhp;
> > struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
> > long bl, count;
> > +   void *to_free[16];
> > +   unsigned int to_free_idx = 0;
> > 
> > /* If no callbacks are ready, just return. */
> > if (!rcu_segcblist_ready_cbs(>cblist)) {
> > @@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_stat

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer
On Tue, 19 Dec 2017 16:20:51 -0800
"Paul E. McKenney"  wrote:

> On Tue, Dec 19, 2017 at 02:12:06PM -0800, Matthew Wilcox wrote:
> > On Tue, Dec 19, 2017 at 09:41:58PM +0100, Jesper Dangaard Brouer wrote:  
> > > If I had to implement this: I would choose to do the optimization in
> > > __rcu_process_callbacks() create small on-call-stack ptr-array for
> > > kfree_bulk().  I would only optimize the case that call kfree()
> > > directly.  In the while(list) loop I would defer calling
> > > __rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
> > > them to the ptr-array (and flush if the array is full in loop, and
> > > kfree_bulk flush after loop).
> > > 
> > > The real advantage of kfree_bulk() comes from amortizing the per kfree
> > > (behind-the-scenes) sync cost.  There is an additional benefit, because
> > > objects comes from RCU and will hit a slower path in SLUB.   The SLUB
> > > allocator is very fast for objects that gets recycled quickly (short
> > > lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
> > > longer-lived/more-outstanding objects, as this hits a slower code-path,
> > > fully locked (cross-cpu) double-cmpxchg.
> > 
> > Something like this ...  (compile tested only)

Yes, exactly.

> > Considerably less code; Rao, what do you think?  
> 
> I am sorry, but I am not at all fan of this approach.
> 
> If we are going to make this sort of change, we should do so in a way
> that allows the slab code to actually do the optimizations that might
> make this sort of thing worthwhile.  After all, if the main goal was small
> code size, the best approach is to drop kfree_bulk() and get on with life
> in the usual fashion.
> 
> I would prefer to believe that something like kfree_bulk() can help,
> and if that is the case, we should give it a chance to do things like
> group kfree_rcu() requests by destination slab and soforth, allowing
> batching optimizations that might provide more significant increases
> in performance.  Furthermore, having this in slab opens the door to
> slab taking emergency action when memory is low.

I agree with your argument. Although in the (slub) code I do handle
different destination slab's, but only do a limited look-ahead to find
same dest-slab's which gives the speedup (see build_detached_freelist).

We do have a larger and more consistent speedup potential, if adding
infrastructure that allow us to pre-sort by destination slab, before
invoking kfree_bulk().  In that respect, Rao's patch is a better
approach.

> But for the patch below, NAK.
> 
>   Thanx, Paul
> 
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 59c471de342a..5ac4ed077233 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -174,20 +174,19 @@ static inline void debug_rcu_head_unqueue(struct 
> > rcu_head *head)
> >  }
> >  #endif /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */
> > 
> > -void kfree(const void *);
> > -
> >  /*
> >   * Reclaim the specified callback, either by invoking it (non-lazy case)
> >   * or freeing it directly (lazy case).  Return true if lazy, false 
> > otherwise.
> >   */
> > -static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head)
> > +static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head, 
> > void **kfree,
> > +   unsigned int *idx)
> >  {
> > unsigned long offset = (unsigned long)head->func;
> > 
> > rcu_lock_acquire(_callback_map);
> > if (__is_kfree_rcu_offset(offset)) {
> > RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset);)
> > -   kfree((void *)head - offset);
> > +   kfree[*idx++] = (void *)head - offset;
> > rcu_lock_release(_callback_map);
> > return true;
> > } else {
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index f9c0ca2ccf0c..7e13979b4697 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2725,6 +2725,8 @@ static void rcu_do_batch(struct rcu_state *rsp, 
> > struct rcu_data *rdp)
> > struct rcu_head *rhp;
> > struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
> > long bl, count;
> > +   void *to_free[16];
> > +   unsigned int to_free_idx = 0;
> > 
> > /* If no callbacks are ready, just return. */
> > if (!rcu_segcblist_ready_cbs(>cblist)) {
> > @@ -2755,8 +2757,10 @@ static void rcu_do_batch(struct rcu_state *rsp, 
> > struct rcu_data *rdp)

Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> +/* Main RCU function that is called to free RCU structures */
> +static void
> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> lazy)
> +{
> + unsigned long offset;
> + void *ptr;
> + struct rcu_bulk_free *rbf;
> + struct rcu_bulk_free_container *rbfc = NULL;
> +
> + rbf = this_cpu_ptr(_rbf);
> +
> + if (unlikely(!rbf->rbf_init)) {
> + spin_lock_init(>rbf_lock);
> + rbf->rbf_cpu = smp_processor_id();
> + rbf->rbf_init = true;
> + }
> +
> + /* hold lock to protect against other cpu's */
> + spin_lock_bh(>rbf_lock);

I'm not sure this will be faster.  Having to take a cross CPU lock here
(+ BH-disable) could cause scaling issues.   Hopefully this lock will
not be used intensively by other CPUs, right?


The current cost of __call_rcu() is a local_irq_save/restore (which is
quite expensive, but doesn't cause cross CPU chatter).

Later in __rcu_process_callbacks() we have a local_irq_save/restore for
the entire list, plus a per object cost doing local_bh_disable/enable.
And for each object we call __rcu_reclaim(), which in some cases
directly call kfree().


If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).

The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.  

> +
> + rbfc = rbf->rbf_container;
> +
> + if (rbfc == NULL) {
> + if (rbf->rbf_cached_container == NULL) {
> + rbf->rbf_container =
> + kmalloc(sizeof(struct rcu_bulk_free_container),
> + GFP_ATOMIC);
> + rbf->rbf_container->rbfc_rbf = rbf;
> + } else {
> + rbf->rbf_container = rbf->rbf_cached_container;
> + rbf->rbf_container->rbfc_rbf = rbf;
> + cmpxchg(>rbf_cached_container,
> + rbf->rbf_cached_container, NULL);
> + }
> +
> + if (unlikely(rbf->rbf_container == NULL)) {
> +
> + /* Memory allocation failed maintain a list */
> +
> + head->func = (void *)func;
> + head->next = rbf->rbf_list_head;
> + rbf->rbf_list_head = head;
> + rbf->rbf_list_size++;
> + if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
> + __rcu_bulk_schedule_list(rbf);
> +
> + goto done;
> + }
> +
> + rbfc = rbf->rbf_container;
> + rbfc->rbfc_entries = 0;
> +
> + if (rbf->rbf_list_head != NULL)
> + __rcu_bulk_schedule_list(rbf);
> + }
> +
> + offset = (unsigned long)func;
> + ptr = (void *)head - offset;
> +
> + rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
> + if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
> +
> + WRITE_ONCE(rbf->rbf_container, NULL);
> + spin_unlock_bh(>rbf_lock);
> + call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
> + return;
> + }
> +
> +done:
> + if (!rbf->rbf_monitor) {
> +
> + call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
> + rbf->rbf_monitor = true;
> + }
> +
> + spin_unlock_bh(>rbf_lock);
> +}


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> +/* Main RCU function that is called to free RCU structures */
> +static void
> +__rcu_bulk_free(struct rcu_head *head, rcu_callback_t func, int cpu, bool 
> lazy)
> +{
> + unsigned long offset;
> + void *ptr;
> + struct rcu_bulk_free *rbf;
> + struct rcu_bulk_free_container *rbfc = NULL;
> +
> + rbf = this_cpu_ptr(_rbf);
> +
> + if (unlikely(!rbf->rbf_init)) {
> + spin_lock_init(>rbf_lock);
> + rbf->rbf_cpu = smp_processor_id();
> + rbf->rbf_init = true;
> + }
> +
> + /* hold lock to protect against other cpu's */
> + spin_lock_bh(>rbf_lock);

I'm not sure this will be faster.  Having to take a cross CPU lock here
(+ BH-disable) could cause scaling issues.   Hopefully this lock will
not be used intensively by other CPUs, right?


The current cost of __call_rcu() is a local_irq_save/restore (which is
quite expensive, but doesn't cause cross CPU chatter).

Later in __rcu_process_callbacks() we have a local_irq_save/restore for
the entire list, plus a per object cost doing local_bh_disable/enable.
And for each object we call __rcu_reclaim(), which in some cases
directly call kfree().


If I had to implement this: I would choose to do the optimization in
__rcu_process_callbacks() create small on-call-stack ptr-array for
kfree_bulk().  I would only optimize the case that call kfree()
directly.  In the while(list) loop I would defer calling
__rcu_reclaim() for __is_kfree_rcu_offset(head->func), and instead add
them to the ptr-array (and flush if the array is full in loop, and
kfree_bulk flush after loop).

The real advantage of kfree_bulk() comes from amortizing the per kfree
(behind-the-scenes) sync cost.  There is an additional benefit, because
objects comes from RCU and will hit a slower path in SLUB.   The SLUB
allocator is very fast for objects that gets recycled quickly (short
lifetime), non-locked (cpu-local) double-cmpxchg.  But slower for
longer-lived/more-outstanding objects, as this hits a slower code-path,
fully locked (cross-cpu) double-cmpxchg.  

> +
> + rbfc = rbf->rbf_container;
> +
> + if (rbfc == NULL) {
> + if (rbf->rbf_cached_container == NULL) {
> + rbf->rbf_container =
> + kmalloc(sizeof(struct rcu_bulk_free_container),
> + GFP_ATOMIC);
> + rbf->rbf_container->rbfc_rbf = rbf;
> + } else {
> + rbf->rbf_container = rbf->rbf_cached_container;
> + rbf->rbf_container->rbfc_rbf = rbf;
> + cmpxchg(>rbf_cached_container,
> + rbf->rbf_cached_container, NULL);
> + }
> +
> + if (unlikely(rbf->rbf_container == NULL)) {
> +
> + /* Memory allocation failed maintain a list */
> +
> + head->func = (void *)func;
> + head->next = rbf->rbf_list_head;
> + rbf->rbf_list_head = head;
> + rbf->rbf_list_size++;
> + if (rbf->rbf_list_size == RCU_MAX_ACCUMULATE_SIZE)
> + __rcu_bulk_schedule_list(rbf);
> +
> + goto done;
> + }
> +
> + rbfc = rbf->rbf_container;
> + rbfc->rbfc_entries = 0;
> +
> + if (rbf->rbf_list_head != NULL)
> + __rcu_bulk_schedule_list(rbf);
> + }
> +
> + offset = (unsigned long)func;
> + ptr = (void *)head - offset;
> +
> + rbfc->rbfc_data[rbfc->rbfc_entries++] = ptr;
> + if (rbfc->rbfc_entries == RCU_MAX_ACCUMULATE_SIZE) {
> +
> + WRITE_ONCE(rbf->rbf_container, NULL);
> + spin_unlock_bh(>rbf_lock);
> + call_rcu(>rbfc_rcu, __rcu_bulk_free_impl);
> + return;
> + }
> +
> +done:
> + if (!rbf->rbf_monitor) {
> +
> + call_rcu(>rbf_rcu, __rcu_bulk_free_monitor);
> + rbf->rbf_monitor = true;
> + }
> +
> + spin_unlock_bh(>rbf_lock);
> +}


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index c8cb367..06fd12c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t nr,
>  
>   for (i = 0; i < nr; i++) {
>   void *x = p[i] = kmem_cache_alloc(s, flags);
> +
>   if (!x) {
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;
> @@ -353,6 +355,7 @@ unsigned long calculate_alignment(slab_flags_t flags,
>*/
>   if (flags & SLAB_HWCACHE_ALIGN) {
>   unsigned long ralign = cache_line_size();
> +
>   while (size <= ralign / 2)
>   ralign /= 2;
>   align = max(align, ralign);
> @@ -444,9 +447,8 @@ kmem_cache_create(const char *name, size_t size, size_t 
> align,
>   mutex_lock(_mutex);
>  
>   err = kmem_cache_sanity_check(name, size);
> - if (err) {
> + if (err)
>   goto out_unlock;
> - }
>  
>   /* Refuse requests with allocator specific flags */
>   if (flags & ~SLAB_FLAGS_PERMITTED) {
> @@ -1131,6 +1133,7 @@ EXPORT_SYMBOL(kmalloc_order);
>  void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
>  {
>   void *ret = kmalloc_order(size, flags, order);
> +
>   trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << order, flags);
>   return ret;
>  }

Looks like you are mixing in cleanups (which should be avoided, and
instead moved to another patch).

> @@ -1483,6 +1486,197 @@ void kzfree(const void *p)
[...]
> +
> +/* processes list of rcu structures
> + * used when conatiner can not be allocated
> + */

Spelling.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] kfree_rcu() should use the new kfree_bulk() interface for freeing rcu structures

2017-12-19 Thread Jesper Dangaard Brouer

On Tue, 19 Dec 2017 09:52:27 -0800 rao.sho...@oracle.com wrote:

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index c8cb367..06fd12c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -129,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t nr,
>  
>   for (i = 0; i < nr; i++) {
>   void *x = p[i] = kmem_cache_alloc(s, flags);
> +
>   if (!x) {
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;
> @@ -353,6 +355,7 @@ unsigned long calculate_alignment(slab_flags_t flags,
>*/
>   if (flags & SLAB_HWCACHE_ALIGN) {
>   unsigned long ralign = cache_line_size();
> +
>   while (size <= ralign / 2)
>   ralign /= 2;
>   align = max(align, ralign);
> @@ -444,9 +447,8 @@ kmem_cache_create(const char *name, size_t size, size_t 
> align,
>   mutex_lock(_mutex);
>  
>   err = kmem_cache_sanity_check(name, size);
> - if (err) {
> + if (err)
>   goto out_unlock;
> - }
>  
>   /* Refuse requests with allocator specific flags */
>   if (flags & ~SLAB_FLAGS_PERMITTED) {
> @@ -1131,6 +1133,7 @@ EXPORT_SYMBOL(kmalloc_order);
>  void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
>  {
>   void *ret = kmalloc_order(size, flags, order);
> +
>   trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << order, flags);
>   return ret;
>  }

Looks like you are mixing in cleanups (which should be avoided, and
instead moved to another patch).

> @@ -1483,6 +1486,197 @@ void kzfree(const void *p)
[...]
> +
> +/* processes list of rcu structures
> + * used when conatiner can not be allocated
> + */

Spelling.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH RT] mm/slub: close possible memory-leak in kmem_cache_alloc_bulk()

2017-12-13 Thread Jesper Dangaard Brouer
On Wed, 13 Dec 2017 15:05:55 +0100
Sebastian Andrzej Siewior <bige...@linutronix.de> wrote:

> Under certain circumstances we could leak elements which were moved to
> the local "to_free" list. The damage is limited since I can't find any
> users here.
> 
> Cc: stable...@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bige...@linutronix.de>
> ---
> Jesper: There are no users of kmem_cache_alloc_bulk() and kfree_bulk().
> Only kmem_cache_free_bulk() is used since it was introduced. Do you
> think that it would make sense to remove those?

I would like to keep them.

Rao Shoaib (Cc'ed) is/was working on a patchset for RCU-bulk-free that
used the kfree_bulk() API.

I plan to use kmem_cache_alloc_bulk() in the bpf-map "cpumap", for bulk
allocating SKBs during dequeue of XDP frames.  (My original bulk alloc
SKBs use-case during NAPI/softirq was never merged).


>  mm/slub.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index ffd2fa0f415e..9053e929ce9d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3240,6 +3240,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t size,
>   return i;
>  error:
>   local_irq_enable();
> + free_delayed(_free);
>   slab_post_alloc_hook(s, flags, i, p);
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;

I've not seen free_delayed() before... and my cscope cannot find it...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH RT] mm/slub: close possible memory-leak in kmem_cache_alloc_bulk()

2017-12-13 Thread Jesper Dangaard Brouer
On Wed, 13 Dec 2017 15:05:55 +0100
Sebastian Andrzej Siewior  wrote:

> Under certain circumstances we could leak elements which were moved to
> the local "to_free" list. The damage is limited since I can't find any
> users here.
> 
> Cc: stable...@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
> Jesper: There are no users of kmem_cache_alloc_bulk() and kfree_bulk().
> Only kmem_cache_free_bulk() is used since it was introduced. Do you
> think that it would make sense to remove those?

I would like to keep them.

Rao Shoaib (Cc'ed) is/was working on a patchset for RCU-bulk-free that
used the kfree_bulk() API.

I plan to use kmem_cache_alloc_bulk() in the bpf-map "cpumap", for bulk
allocating SKBs during dequeue of XDP frames.  (My original bulk alloc
SKBs use-case during NAPI/softirq was never merged).


>  mm/slub.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index ffd2fa0f415e..9053e929ce9d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3240,6 +3240,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t size,
>   return i;
>  error:
>   local_irq_enable();
> + free_delayed(_free);
>   slab_post_alloc_hook(s, flags, i, p);
>   __kmem_cache_free_bulk(s, i, p);
>   return 0;

I've not seen free_delayed() before... and my cscope cannot find it...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next] net: thunderx: Add support for xdp redirect

2017-12-11 Thread Jesper Dangaard Brouer
xdp->data,
> +   xdp_tx->dma_addr,
> +   xdp->data_end - xdp->data);
> + if (err)
> + return -ENOMEM;
> +
> + nicvf_xdp_sq_doorbell(snic, sq, qidx);
> + return 0;
> +}
> +
> +static void nicvf_xdp_flush(struct net_device *dev)
> +{
> + return;
> +}
> +
>  static const struct net_device_ops nicvf_netdev_ops = {
>   .ndo_open   = nicvf_open,
>   .ndo_stop   = nicvf_stop,
> @@ -1775,6 +1831,8 @@ static const struct net_device_ops nicvf_netdev_ops = {
>   .ndo_fix_features   = nicvf_fix_features,
>   .ndo_set_features   = nicvf_set_features,
>   .ndo_bpf= nicvf_xdp,
> + .ndo_xdp_xmit   = nicvf_xdp_xmit,
> + .ndo_xdp_flush  = nicvf_xdp_flush,
>  };
[...]


> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h 
> b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> index 67d1a3230773..178ab6e8e3c5 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> @@ -11,6 +11,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include "q_struct.h"
>  
>  #define MAX_QUEUE_SET128
> @@ -92,6 +93,9 @@
>  #define RCV_FRAG_LEN  (SKB_DATA_ALIGN(DMA_BUFFER_LEN + NET_SKB_PAD) + \
>SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>  
> +#define RCV_BUF_HEADROOM 128 /* To store dma address for XDP redirect */
> +#define XDP_HEADROOM (XDP_PACKET_HEADROOM + RCV_BUF_HEADROOM)
> +
>  #define MAX_CQES_FOR_TX  ((SND_QUEUE_LEN / 
> MIN_SQ_DESC_PER_PKT_XMIT) * \
>MAX_CQE_PER_PKT_XMIT)
>  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next] net: thunderx: Add support for xdp redirect

2017-12-11 Thread Jesper Dangaard Brouer
);
> + if (err)
> + return -ENOMEM;
> +
> + nicvf_xdp_sq_doorbell(snic, sq, qidx);
> + return 0;
> +}
> +
> +static void nicvf_xdp_flush(struct net_device *dev)
> +{
> + return;
> +}
> +
>  static const struct net_device_ops nicvf_netdev_ops = {
>   .ndo_open   = nicvf_open,
>   .ndo_stop   = nicvf_stop,
> @@ -1775,6 +1831,8 @@ static const struct net_device_ops nicvf_netdev_ops = {
>   .ndo_fix_features   = nicvf_fix_features,
>   .ndo_set_features   = nicvf_set_features,
>   .ndo_bpf= nicvf_xdp,
> + .ndo_xdp_xmit   = nicvf_xdp_xmit,
> + .ndo_xdp_flush  = nicvf_xdp_flush,
>  };
[...]


> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h 
> b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> index 67d1a3230773..178ab6e8e3c5 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
> @@ -11,6 +11,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include "q_struct.h"
>  
>  #define MAX_QUEUE_SET        128
> @@ -92,6 +93,9 @@
>  #define RCV_FRAG_LEN  (SKB_DATA_ALIGN(DMA_BUFFER_LEN + NET_SKB_PAD) + \
>SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>  
> +#define RCV_BUF_HEADROOM 128 /* To store dma address for XDP redirect */
> +#define XDP_HEADROOM (XDP_PACKET_HEADROOM + RCV_BUF_HEADROOM)
> +
>  #define MAX_CQES_FOR_TX  ((SND_QUEUE_LEN / 
> MIN_SQ_DESC_PER_PKT_XMIT) * \
>MAX_CQE_PER_PKT_XMIT)
>  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] trace/xdp: fix compile warning: ‘struct bpf_map’ declared inside parameter list

2017-11-29 Thread Jesper Dangaard Brouer
On Wed, 29 Nov 2017 13:49:01 +
David Laight <david.lai...@aculab.com> wrote:

> From: Xie XiuQi
> > Sent: 29 November 2017 08:35
> >
> > We meet this compile warning, which caused by missing bpf.h in xdp.h.
> > 
> > In file included from ./include/trace/events/xdp.h:10:0,
> >  from ./include/linux/bpf_trace.h:6,
> >  from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
> > ./include/trace/events/xdp.h:93:17: warning: struct bpf_map declared inside 
> > parameter list will not be
> > visible outside of this definition or declaration
> > const struct bpf_map *map, u32 map_index),
> >  ^  
> ...
> > diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> > index 4cd0f05..8989a92 100644
> > --- a/include/trace/events/xdp.h
> > +++ b/include/trace/events/xdp.h
> > @@ -8,6 +8,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include   
> 
> Isn't it just enough to add:
> struct bpf_map;
> before the first prototype instead of pulling in the entire header?

Nope, because we deref map->id.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] trace/xdp: fix compile warning: ‘struct bpf_map’ declared inside parameter list

2017-11-29 Thread Jesper Dangaard Brouer
On Wed, 29 Nov 2017 13:49:01 +
David Laight  wrote:

> From: Xie XiuQi
> > Sent: 29 November 2017 08:35
> >
> > We meet this compile warning, which caused by missing bpf.h in xdp.h.
> > 
> > In file included from ./include/trace/events/xdp.h:10:0,
> >  from ./include/linux/bpf_trace.h:6,
> >  from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
> > ./include/trace/events/xdp.h:93:17: warning: struct bpf_map declared inside 
> > parameter list will not be
> > visible outside of this definition or declaration
> > const struct bpf_map *map, u32 map_index),
> >  ^  
> ...
> > diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> > index 4cd0f05..8989a92 100644
> > --- a/include/trace/events/xdp.h
> > +++ b/include/trace/events/xdp.h
> > @@ -8,6 +8,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include   
> 
> Isn't it just enough to add:
> struct bpf_map;
> before the first prototype instead of pulling in the entire header?

Nope, because we deref map->id.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] trace/xdp: fix compile warning: ‘struct bpf_map’ declared inside parameter list

2017-11-29 Thread Jesper Dangaard Brouer
On Wed, 29 Nov 2017 16:35:01 +0800
Xie XiuQi <xiexi...@huawei.com> wrote:

> We meet this compile warning, which caused by missing bpf.h in xdp.h.
> 
> In file included from ./include/trace/events/xdp.h:10:0,
>  from ./include/linux/bpf_trace.h:6,
>  from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
> ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> const struct bpf_map *map, u32 map_index),
>  ^
> ./include/linux/tracepoint.h:187:34: note: in definition of macro 
> ‘__DECLARE_TRACE’
>   static inline void trace_##name(proto)\
>   ^
> ./include/linux/tracepoint.h:352:24: note: in expansion of macro ‘PARAMS’
>   __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
> ^~
> ./include/linux/tracepoint.h:477:2: note: in expansion of macro 
> ‘DECLARE_TRACE’
>   DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
>   ^
> ./include/linux/tracepoint.h:477:22: note: in expansion of macro ‘PARAMS’
>   DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
>   ^~
> ./include/trace/events/xdp.h:89:1: note: in expansion of macro ‘DEFINE_EVENT’
>  DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
>  ^~~~
> ./include/trace/events/xdp.h:90:2: note: in expansion of macro ‘TP_PROTO’
>   TP_PROTO(const struct net_device *dev,
>   ^~~~
> ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> const struct bpf_map *map, u32 map_index),
>  ^
> ./include/linux/tracepoint.h:203:38: note: in definition of macro 
> ‘__DECLARE_TRACE’
>   register_trace_##name(void (*probe)(data_proto), void *data) \
>   ^~
> ./include/linux/tracepoint.h:354:4: note: in expansion of macro ‘PARAMS’
> PARAMS(void *__data, proto),   \
> ^~
> 
> Reported-by: Huang Daode <huangda...@hisilicon.com>
> Cc: Hanjun Guo <guohan...@huawei.com>
> Signed-off-by: Xie XiuQi <xiexi...@huawei.com>
> ---
>  include/trace/events/xdp.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index 4cd0f05..8989a92 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define __XDP_ACT_MAP(FN)\
>   FN(ABORTED) \

Strange that I'm not see this compile issue, and kbuild-bot also didn't
report it, but the patch looks okay to me... I guess I introduced the
issue in below "fixes" commit.  Can the person applying this include
the fixes line?

Fixes: 8d3b778ff544 ("xdp: tracepoint xdp_redirect also need a map argument")

Acked-by: Jesper Dangaard Brouer <bro...@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] trace/xdp: fix compile warning: ‘struct bpf_map’ declared inside parameter list

2017-11-29 Thread Jesper Dangaard Brouer
On Wed, 29 Nov 2017 16:35:01 +0800
Xie XiuQi  wrote:

> We meet this compile warning, which caused by missing bpf.h in xdp.h.
> 
> In file included from ./include/trace/events/xdp.h:10:0,
>  from ./include/linux/bpf_trace.h:6,
>  from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
> ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> const struct bpf_map *map, u32 map_index),
>  ^
> ./include/linux/tracepoint.h:187:34: note: in definition of macro 
> ‘__DECLARE_TRACE’
>   static inline void trace_##name(proto)\
>   ^
> ./include/linux/tracepoint.h:352:24: note: in expansion of macro ‘PARAMS’
>   __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
> ^~
> ./include/linux/tracepoint.h:477:2: note: in expansion of macro 
> ‘DECLARE_TRACE’
>   DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
>   ^
> ./include/linux/tracepoint.h:477:22: note: in expansion of macro ‘PARAMS’
>   DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
>   ^~
> ./include/trace/events/xdp.h:89:1: note: in expansion of macro ‘DEFINE_EVENT’
>  DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
>  ^~~~
> ./include/trace/events/xdp.h:90:2: note: in expansion of macro ‘TP_PROTO’
>   TP_PROTO(const struct net_device *dev,
>   ^~~~
> ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> const struct bpf_map *map, u32 map_index),
>  ^
> ./include/linux/tracepoint.h:203:38: note: in definition of macro 
> ‘__DECLARE_TRACE’
>   register_trace_##name(void (*probe)(data_proto), void *data) \
>   ^~
> ./include/linux/tracepoint.h:354:4: note: in expansion of macro ‘PARAMS’
> PARAMS(void *__data, proto),   \
> ^~
> 
> Reported-by: Huang Daode 
> Cc: Hanjun Guo 
> Signed-off-by: Xie XiuQi 
> ---
>  include/trace/events/xdp.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
> index 4cd0f05..8989a92 100644
> --- a/include/trace/events/xdp.h
> +++ b/include/trace/events/xdp.h
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define __XDP_ACT_MAP(FN)\
>   FN(ABORTED) \

Strange that I'm not see this compile issue, and kbuild-bot also didn't
report it, but the patch looks okay to me... I guess I introduced the
issue in below "fixes" commit.  Can the person applying this include
the fixes line?

Fixes: 8d3b778ff544 ("xdp: tracepoint xdp_redirect also need a map argument")

Acked-by: Jesper Dangaard Brouer 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v4 1/1] xdp: Sample xdp program implementing ip forward

2017-11-08 Thread Jesper Dangaard Brouer
On Wed, 08 Nov 2017 10:40:24 +0900 (KST)
David Miller <da...@davemloft.net> wrote:

> From: Christina Jacob <christina.jacob.koik...@gmail.com>
> Date: Sun,  5 Nov 2017 08:52:30 +0530
> 
> > From: Christina Jacob <christina.ja...@cavium.com>
> > 
> > Implements port to port forwarding with route table and arp table
> > lookup for ipv4 packets using bpf_redirect helper function and
> > lpm_trie  map.
> > 
> > Signed-off-by: Christina Jacob <christina.ja...@cavium.com>  
> 
> Applied to net-next, thank you.

I've not had time to proper test (and review) this V4 patch, but I
guess I'll have to do so when I get home from Seoul...

I especially want to measure the effect of using bpf_redirect_map().
To Christina: what performance improvement did you see on your
board/arch when switching from bpf_redirect() to bpf_redirect_map()?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v4 1/1] xdp: Sample xdp program implementing ip forward

2017-11-08 Thread Jesper Dangaard Brouer
On Wed, 08 Nov 2017 10:40:24 +0900 (KST)
David Miller  wrote:

> From: Christina Jacob 
> Date: Sun,  5 Nov 2017 08:52:30 +0530
> 
> > From: Christina Jacob 
> > 
> > Implements port to port forwarding with route table and arp table
> > lookup for ipv4 packets using bpf_redirect helper function and
> > lpm_trie  map.
> > 
> > Signed-off-by: Christina Jacob   
> 
> Applied to net-next, thank you.

I've not had time to proper test (and review) this V4 patch, but I
guess I'll have to do so when I get home from Seoul...

I especially want to measure the effect of using bpf_redirect_map().
To Christina: what performance improvement did you see on your
board/arch when switching from bpf_redirect() to bpf_redirect_map()?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 1/1] xdp: Sample xdp program implementing ip forward

2017-11-02 Thread Jesper Dangaard Brouer
On Wed,  1 Nov 2017 13:18:04 +0530 Christina Jacob 
<christina.jacob.koik...@gmail.com> wrote:

> From: Christina Jacob <christina.ja...@cavium.com>
> 
> Implements port to port forwarding with route table and arp table
> lookup for ipv4 packets using bpf_redirect helper function and
> lpm_trie  map.
> Signed-off-by: Christina Jacob <christina.ja...@cavium.com>

There is usually a line between the desc and Signed-off-by.

> ---
>  samples/bpf/Makefile   |   4 +
>  samples/bpf/xdp_router_ipv4_kern.c | 181 ++
>  samples/bpf/xdp_router_ipv4_user.c | 657 
> +
>  3 files changed, 842 insertions(+)
> 
[...]
> diff --git a/samples/bpf/xdp_router_ipv4_kern.c 
> b/samples/bpf/xdp_router_ipv4_kern.c
> new file mode 100644
> index 000..70a5907
> --- /dev/null
> +++ b/samples/bpf/xdp_router_ipv4_kern.c
> @@ -0,0 +1,181 @@
> +/* Copyright (C) 2017 Cavium, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2 of the GNU General Public License
> + * as published by the Free Software Foundation.
> + */
[...]
> +SEC("xdp3")
> +int xdp_prog3(struct xdp_md *ctx)

You changed the filename from xdp3 to xdp_router_ipv4, but you didn't
change the name in he code.

> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + __be64 *dest_mac = NULL, *src_mac = NULL;
> + void *data = (void *)(long)ctx->data;
> + struct trie_value *prefix_value;
> + int rc = XDP_DROP, forward_to;
> + struct ethhdr *eth = data;
> + union key_4 key4;
> + long *value;
> + u16 h_proto;
> + u32 ipproto;
> + u64 nh_off;
> +
[..]
> + if (h_proto == htons(ETH_P_ARP)) {
> + return XDP_PASS;
> + } else if (h_proto == htons(ETH_P_IP)) {
> + struct direct_map *direct_entry;
> + __be32 src_ip = 0, dest_ip = 0;
> +
> + ipproto = parse_ipv4(data, nh_off, data_end, _ip, _ip);
> + direct_entry = (struct direct_map *)bpf_map_lookup_elem
> + (_match, _ip);

I don't think you need this type-casting.


> + /* Check for exact match, this would give a faster lookup*/
> + if (direct_entry && direct_entry->mac && direct_entry->arp.mac) 
> {
> + src_mac = _entry->mac;
> + dest_mac = _entry->arp.mac;
> + forward_to = direct_entry->ifindex;
> + } else {
> + /* Look up in the trie for lpm*/
> + key4.b32[0] = 32;
> + key4.b8[4] = dest_ip & 0xff;
> + key4.b8[5] = (dest_ip >> 8) & 0xff;
> + key4.b8[6] = (dest_ip >> 16) & 0xff;
> + key4.b8[7] = (dest_ip >> 24) & 0xff;
> + prefix_value = ((struct trie_value *)bpf_map_lookup_elem
> + (_map, ));
> + if (!prefix_value)
> + return XDP_DROP;
> + src_mac = _value->value;
> + if (!src_mac)
> + return XDP_DROP;
> + dest_mac = (__be64 *)bpf_map_lookup_elem(_table, 
> _ip);
> + if (!dest_mac) {
> + if (!prefix_value->gw)
> + return XDP_DROP;
> + dest_ip = *(__be32 *)_value->gw;
> + dest_mac = (__be64 
> *)bpf_map_lookup_elem(_table, _ip);
> + }
> + forward_to = prefix_value->ifindex;
> + }
> + } else {
> + ipproto = 0;
> + }
> + if (src_mac && dest_mac) {
> + set_src_dst_mac(data, src_mac, dest_mac);
> + value = bpf_map_lookup_elem(, );
> + if (value)
> + *value += 1;
> + return  bpf_redirect(forward_to, 0);

Notice that using bpf_redirect() is slow, while using bpf_redirect_map()
is fast.  Using bpf_redirect_map() requires a little more book keeping,
but the performance gain is worth it.

Raw benchmarks on my system show:
 * bpf_redirect() max at  7Mpps
 * bpf_redirect_map() at 13Mpps

Trying out your program on my systems showed it jumps between 5.6Mpps
to 7Mpps.  And it seems to be correlated with matching direct_entry.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 1/1] xdp: Sample xdp program implementing ip forward

2017-11-02 Thread Jesper Dangaard Brouer
On Wed,  1 Nov 2017 13:18:04 +0530 Christina Jacob 
 wrote:

> From: Christina Jacob 
> 
> Implements port to port forwarding with route table and arp table
> lookup for ipv4 packets using bpf_redirect helper function and
> lpm_trie  map.
> Signed-off-by: Christina Jacob 

There is usually a line between the desc and Signed-off-by.

> ---
>  samples/bpf/Makefile   |   4 +
>  samples/bpf/xdp_router_ipv4_kern.c | 181 ++
>  samples/bpf/xdp_router_ipv4_user.c | 657 
> +
>  3 files changed, 842 insertions(+)
> 
[...]
> diff --git a/samples/bpf/xdp_router_ipv4_kern.c 
> b/samples/bpf/xdp_router_ipv4_kern.c
> new file mode 100644
> index 000..70a5907
> --- /dev/null
> +++ b/samples/bpf/xdp_router_ipv4_kern.c
> @@ -0,0 +1,181 @@
> +/* Copyright (C) 2017 Cavium, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2 of the GNU General Public License
> + * as published by the Free Software Foundation.
> + */
[...]
> +SEC("xdp3")
> +int xdp_prog3(struct xdp_md *ctx)

You changed the filename from xdp3 to xdp_router_ipv4, but you didn't
change the name in he code.

> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + __be64 *dest_mac = NULL, *src_mac = NULL;
> + void *data = (void *)(long)ctx->data;
> + struct trie_value *prefix_value;
> + int rc = XDP_DROP, forward_to;
> + struct ethhdr *eth = data;
> + union key_4 key4;
> + long *value;
> + u16 h_proto;
> + u32 ipproto;
> + u64 nh_off;
> +
[..]
> + if (h_proto == htons(ETH_P_ARP)) {
> + return XDP_PASS;
> + } else if (h_proto == htons(ETH_P_IP)) {
> + struct direct_map *direct_entry;
> + __be32 src_ip = 0, dest_ip = 0;
> +
> + ipproto = parse_ipv4(data, nh_off, data_end, _ip, _ip);
> + direct_entry = (struct direct_map *)bpf_map_lookup_elem
> + (_match, _ip);

I don't think you need this type-casting.


> + /* Check for exact match, this would give a faster lookup*/
> + if (direct_entry && direct_entry->mac && direct_entry->arp.mac) 
> {
> + src_mac = _entry->mac;
> + dest_mac = _entry->arp.mac;
> + forward_to = direct_entry->ifindex;
> + } else {
> + /* Look up in the trie for lpm*/
> + key4.b32[0] = 32;
> + key4.b8[4] = dest_ip & 0xff;
> + key4.b8[5] = (dest_ip >> 8) & 0xff;
> + key4.b8[6] = (dest_ip >> 16) & 0xff;
> + key4.b8[7] = (dest_ip >> 24) & 0xff;
> + prefix_value = ((struct trie_value *)bpf_map_lookup_elem
> + (_map, ));
> + if (!prefix_value)
> + return XDP_DROP;
> + src_mac = _value->value;
> + if (!src_mac)
> + return XDP_DROP;
> + dest_mac = (__be64 *)bpf_map_lookup_elem(_table, 
> _ip);
> + if (!dest_mac) {
> + if (!prefix_value->gw)
> + return XDP_DROP;
> + dest_ip = *(__be32 *)_value->gw;
> + dest_mac = (__be64 
> *)bpf_map_lookup_elem(_table, _ip);
> + }
> + forward_to = prefix_value->ifindex;
> + }
> + } else {
> + ipproto = 0;
> + }
> + if (src_mac && dest_mac) {
> + set_src_dst_mac(data, src_mac, dest_mac);
> + value = bpf_map_lookup_elem(, );
> + if (value)
> + *value += 1;
> + return  bpf_redirect(forward_to, 0);

Notice that using bpf_redirect() is slow, while using bpf_redirect_map()
is fast.  Using bpf_redirect_map() requires a little more book keeping,
but the performance gain is worth it.

Raw benchmarks on my system show:
 * bpf_redirect() max at  7Mpps
 * bpf_redirect_map() at 13Mpps

Trying out your program on my systems showed it jumps between 5.6Mpps
to 7Mpps.  And it seems to be correlated with matching direct_entry.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2] xdp: Sample xdp program implementing ip forward

2017-10-10 Thread Jesper Dangaard Brouer
On Tue, 10 Oct 2017 12:58:52 +0530
Christina Jacob <christina.jacob.koik...@gmail.com> wrote:

> +SEC("xdp3")
> +int xdp_prog3(struct xdp_md *ctx)
> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + void *data = (void *)(long)ctx->data;
> + struct ethhdr *eth = data;
> + int rc = XDP_DROP, forward_to;
> + long *value;
> + struct trie_value *prefix_value;
> + long *dest_mac = NULL, *src_mac = NULL;
> + u16 h_proto;
> + u64 nh_off;
> + u32 ipproto;
> + union key_4 key4;

Reverse-xmas tree rule: Prefer ordering declarations longest to shortest.


[...]
> + if (h_proto == htons(ETH_P_ARP)) {
> + return XDP_PASS;
> + } else if (h_proto == htons(ETH_P_IP)) {
> + int src_ip = 0, dest_ip = 0;
> + struct direct_map *direct_entry;
> +
> + ipproto = parse_ipv4(data, nh_off, data_end, _ip, _ip);
> + direct_entry = (struct direct_map *)bpf_map_lookup_elem
> + (_match, _ip);
> + /*check for exact match, this would give a faster lookup*/
> + if (direct_entry && direct_entry->mac && direct_entry->arp.mac) 
> {
> + src_mac = _entry->mac;
> + dest_mac = _entry->arp.mac;
> + forward_to = direct_entry->ifindex;
> + } else {
> + /*Look up in the trie for lpm*/
> + key4.b32[0] = 32;
> + key4.b8[4] = dest_ip % 0x100;
> + key4.b8[5] = (dest_ip >> 8) % 0x100;
> + key4.b8[6] = (dest_ip >> 16) % 0x100;
> + key4.b8[7] = (dest_ip >> 24) % 0x100;
> + prefix_value = ((struct trie_value *)bpf_map_lookup_elem
> + (_map, ));
> + if (!prefix_value) {
> + return XDP_DROP;
> + } else {
> + src_mac = _value->value;
> + if (src_mac) {
> + dest_mac = (long *)bpf_map_lookup_elem
> + (_table, _ip);
> + if (!dest_mac) {
> + if (prefix_value->gw) {
> + dest_ip = *(__be32 
> *)_value->gw;
> + dest_mac = (long 
> *)bpf_map_lookup_elem(_table, _ip);
> + } else {
> + return XDP_DROP;
> + }
> + }
> + forward_to = prefix_value->ifindex;
> + } else {
> + return XDP_DROP;
> + }
> + }
> + }
> + } else {
> + ipproto = 0;
> + }

The nesting in this function is getting annoying to read.
Kernel code often use "early return" style coding to solve this.

A quick search turns up this guide, look at section "Early return"
 
https://en.wikibooks.org/wiki/Computer_Programming/Coding_Style/Minimize_nesting

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2] xdp: Sample xdp program implementing ip forward

2017-10-10 Thread Jesper Dangaard Brouer
On Tue, 10 Oct 2017 12:58:52 +0530
Christina Jacob  wrote:

> +SEC("xdp3")
> +int xdp_prog3(struct xdp_md *ctx)
> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + void *data = (void *)(long)ctx->data;
> + struct ethhdr *eth = data;
> + int rc = XDP_DROP, forward_to;
> + long *value;
> + struct trie_value *prefix_value;
> + long *dest_mac = NULL, *src_mac = NULL;
> + u16 h_proto;
> + u64 nh_off;
> + u32 ipproto;
> + union key_4 key4;

Reverse-xmas tree rule: Prefer ordering declarations longest to shortest.


[...]
> + if (h_proto == htons(ETH_P_ARP)) {
> + return XDP_PASS;
> + } else if (h_proto == htons(ETH_P_IP)) {
> + int src_ip = 0, dest_ip = 0;
> + struct direct_map *direct_entry;
> +
> + ipproto = parse_ipv4(data, nh_off, data_end, _ip, _ip);
> + direct_entry = (struct direct_map *)bpf_map_lookup_elem
> + (_match, _ip);
> + /*check for exact match, this would give a faster lookup*/
> + if (direct_entry && direct_entry->mac && direct_entry->arp.mac) 
> {
> + src_mac = _entry->mac;
> + dest_mac = _entry->arp.mac;
> + forward_to = direct_entry->ifindex;
> + } else {
> + /*Look up in the trie for lpm*/
> + key4.b32[0] = 32;
> + key4.b8[4] = dest_ip % 0x100;
> + key4.b8[5] = (dest_ip >> 8) % 0x100;
> + key4.b8[6] = (dest_ip >> 16) % 0x100;
> + key4.b8[7] = (dest_ip >> 24) % 0x100;
> + prefix_value = ((struct trie_value *)bpf_map_lookup_elem
> + (_map, ));
> + if (!prefix_value) {
> + return XDP_DROP;
> + } else {
> + src_mac = _value->value;
> + if (src_mac) {
> + dest_mac = (long *)bpf_map_lookup_elem
> + (_table, _ip);
> + if (!dest_mac) {
> + if (prefix_value->gw) {
> + dest_ip = *(__be32 
> *)_value->gw;
> + dest_mac = (long 
> *)bpf_map_lookup_elem(_table, _ip);
> + } else {
> + return XDP_DROP;
> + }
> + }
> + forward_to = prefix_value->ifindex;
> + } else {
> + return XDP_DROP;
> + }
> + }
> + }
> + } else {
> + ipproto = 0;
> + }

The nesting in this function is getting annoying to read.
Kernel code often use "early return" style coding to solve this.

A quick search turns up this guide, look at section "Early return"
 
https://en.wikibooks.org/wiki/Computer_Programming/Coding_Style/Minimize_nesting

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2] XDP Program for Ip forward

2017-10-10 Thread Jesper Dangaard Brouer
On Tue, 10 Oct 2017 15:12:31 +0200
Jesper Dangaard Brouer <bro...@redhat.com> wrote:

> I'll try to test/benchmark your program...

In my initial testing, I cannot get this to work...

You do seem to XDP_REDIRECT out the right interface, but you have an
error with setting the correct MAC address.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2] XDP Program for Ip forward

2017-10-10 Thread Jesper Dangaard Brouer
On Tue, 10 Oct 2017 15:12:31 +0200
Jesper Dangaard Brouer  wrote:

> I'll try to test/benchmark your program...

In my initial testing, I cannot get this to work...

You do seem to XDP_REDIRECT out the right interface, but you have an
error with setting the correct MAC address.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2] XDP Program for Ip forward

2017-10-10 Thread Jesper Dangaard Brouer

On Tue, 10 Oct 2017 12:58:51 +0530 Christina Jacob 
<christina.jacob.koik...@gmail.com> wrote:

> The patch below implements port to port forwarding through route table and arp
> table lookup for ipv4 packets using bpf_redirect helper function and lpm_trie
> map.  This has an improved performance over the normal kernel stack ip 
> forward.
> 
> Implementation details.
> ---
[...]
> 
> In the xdp3_user.c,
> 
[...]
> 
> In the xdp3_kern.c,

You forgot to update the program name in the cover letter.

> The array map for the 32 bit mask entries checked to see if there is a key 
> that
> exactly matches with the destination ip. If it has a non zero destination mac
> entry then the xdp data is updated accordingly Otherwise a proper route and 
> arp table lookup is done using the lpm_trie and the arp table array map.
>   
>   Usage: as ./xdp3 -S  (-S for
^
The executable name also changed.

>   generic xdp implementation ifindex- the index of the interface to which
>   the xdp program has to be attached.) in 4.14-rc3 kernel.
> 
> Changes from v1 to v2
> -
>  
> * As suggested by Jesper Dangaard Brouer
>   1. Changed the program name to  list xdp_router_ipv4

Thanks

>   2. Changed the commandline arguments from ifindex list to interface name
>   Usage : ./xdp_router_ipv4 [-S] 
>   -S for generic xdp implementation
>   -interface name list is the list of interfaces to which
>   the xdp program should attach to

Okay, you choose a slightly different way of implementing this, but it
shouldn't matter.

I'll try to test/benchmark your program...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2] XDP Program for Ip forward

2017-10-10 Thread Jesper Dangaard Brouer

On Tue, 10 Oct 2017 12:58:51 +0530 Christina Jacob 
 wrote:

> The patch below implements port to port forwarding through route table and arp
> table lookup for ipv4 packets using bpf_redirect helper function and lpm_trie
> map.  This has an improved performance over the normal kernel stack ip 
> forward.
> 
> Implementation details.
> ---
[...]
> 
> In the xdp3_user.c,
> 
[...]
> 
> In the xdp3_kern.c,

You forgot to update the program name in the cover letter.

> The array map for the 32 bit mask entries checked to see if there is a key 
> that
> exactly matches with the destination ip. If it has a non zero destination mac
> entry then the xdp data is updated accordingly Otherwise a proper route and 
> arp table lookup is done using the lpm_trie and the arp table array map.
>   
>   Usage: as ./xdp3 -S  (-S for
^
The executable name also changed.

>   generic xdp implementation ifindex- the index of the interface to which
>   the xdp program has to be attached.) in 4.14-rc3 kernel.
> 
> Changes from v1 to v2
> -
>  
> * As suggested by Jesper Dangaard Brouer
>   1. Changed the program name to  list xdp_router_ipv4

Thanks

>   2. Changed the commandline arguments from ifindex list to interface name
>   Usage : ./xdp_router_ipv4 [-S] 
>   -S for generic xdp implementation
>   -interface name list is the list of interfaces to which
>   the xdp program should attach to

Okay, you choose a slightly different way of implementing this, but it
shouldn't matter.

I'll try to test/benchmark your program...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 0/1] XDP Program for Ip forward

2017-10-04 Thread Jesper Dangaard Brouer

First of all thank you for working on this.

On Tue,  3 Oct 2017 13:07:04 +0530 cjacob <christina.ja...@cavium.com> wrote:

>   Usage: ./xdp3 [-S]  
> 
>   -S to choose generic xdp implementation 
> [Default is driver xdp implementation]
>   ifindex - the index of the interface to which 
>   the xdp program has to be attached.
>   in 4.14-rc3 kernel.

I would prefer if we can name the program something more descriptive
than "xdp3".  What about "xdp_redirect_router" or "xdp_router_ipv4" ?

I would also appreciate if we can stop using ifindex'es, and instead
use normal device ifname's.  And simply do the lookup to the ifindex in
the program via if_nametoindex(ifname), see how in [1] and [2].

When adding more ifname's you can just use the same trick as with
multiple --cpu options like [1] and [2].

[1] 
http://lkml.kernel.org/r/150711864538.9499.11712573036995600273.stgit@firesoul
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 0/1] XDP Program for Ip forward

2017-10-04 Thread Jesper Dangaard Brouer

First of all thank you for working on this.

On Tue,  3 Oct 2017 13:07:04 +0530 cjacob  wrote:

>   Usage: ./xdp3 [-S]  
> 
>   -S to choose generic xdp implementation 
> [Default is driver xdp implementation]
>   ifindex - the index of the interface to which 
>   the xdp program has to be attached.
>   in 4.14-rc3 kernel.

I would prefer if we can name the program something more descriptive
than "xdp3".  What about "xdp_redirect_router" or "xdp_router_ipv4" ?

I would also appreciate if we can stop using ifindex'es, and instead
use normal device ifname's.  And simply do the lookup to the ifindex in
the program via if_nametoindex(ifname), see how in [1] and [2].

When adding more ifname's you can just use the same trick as with
multiple --cpu options like [1] and [2].

[1] 
http://lkml.kernel.org/r/150711864538.9499.11712573036995600273.stgit@firesoul
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2 0/3] Separate NUMA statistics from zone statistics

2017-08-25 Thread Jesper Dangaard Brouer
On Fri, 25 Aug 2017 09:04:37 +0100
Mel Gorman <mgor...@techsingularity.net> wrote:

> On Thu, Aug 24, 2017 at 05:59:58PM +0800, Kemi Wang wrote:
> > Each page allocation updates a set of per-zone statistics with a call to
> > zone_statistics(). As discussed in 2017 MM summit, these are a substantial
> > source of overhead in the page allocator and are very rarely consumed. This
> > significant overhead in cache bouncing caused by zone counters (NUMA
> > associated counters) update in parallel in multi-threaded page allocation
> > (pointed out by Dave Hansen).
> >   
> 
> For the series;
> 
> Acked-by: Mel Gorman <mgor...@techsingularity.net>
> 

I'm very happy to see these issues being worked on, from our MM-summit
interactions. I would like to provide/have a:

Reported-by: Jesper Dangaard Brouer <bro...@redhat.com>

As I'm not sure an acked-by from me have any value/merit here ;-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2 0/3] Separate NUMA statistics from zone statistics

2017-08-25 Thread Jesper Dangaard Brouer
On Fri, 25 Aug 2017 09:04:37 +0100
Mel Gorman  wrote:

> On Thu, Aug 24, 2017 at 05:59:58PM +0800, Kemi Wang wrote:
> > Each page allocation updates a set of per-zone statistics with a call to
> > zone_statistics(). As discussed in 2017 MM summit, these are a substantial
> > source of overhead in the page allocator and are very rarely consumed. This
> > significant overhead in cache bouncing caused by zone counters (NUMA
> > associated counters) update in parallel in multi-threaded page allocation
> > (pointed out by Dave Hansen).
> >   
> 
> For the series;
> 
> Acked-by: Mel Gorman 
> 

I'm very happy to see these issues being worked on, from our MM-summit
interactions. I would like to provide/have a:

Reported-by: Jesper Dangaard Brouer 

As I'm not sure an acked-by from me have any value/merit here ;-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[tip:perf/core] tracing, perf: Adjust code layout in get_recursion_context()

2017-08-25 Thread tip-bot for Jesper Dangaard Brouer
Commit-ID:  d0618410eced4eb092295fad10312a4545fcdfaf
Gitweb: http://git.kernel.org/tip/d0618410eced4eb092295fad10312a4545fcdfaf
Author: Jesper Dangaard Brouer <bro...@redhat.com>
AuthorDate: Tue, 22 Aug 2017 19:22:43 +0200
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 25 Aug 2017 11:04:18 +0200

tracing, perf: Adjust code layout in get_recursion_context()

In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.85% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (GCC version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Jiri Olsa <jo...@kernel.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/events/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5377c59..843e970 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
+   if (unlikely(in_nmi()))
rctx = 3;
else if (in_irq())
rctx = 2;


[tip:perf/core] tracing, perf: Adjust code layout in get_recursion_context()

2017-08-25 Thread tip-bot for Jesper Dangaard Brouer
Commit-ID:  d0618410eced4eb092295fad10312a4545fcdfaf
Gitweb: http://git.kernel.org/tip/d0618410eced4eb092295fad10312a4545fcdfaf
Author: Jesper Dangaard Brouer 
AuthorDate: Tue, 22 Aug 2017 19:22:43 +0200
Committer:  Ingo Molnar 
CommitDate: Fri, 25 Aug 2017 11:04:18 +0200

tracing, perf: Adjust code layout in get_recursion_context()

In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.85% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (GCC version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul
Signed-off-by: Ingo Molnar 
---
 kernel/events/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5377c59..843e970 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
+   if (unlikely(in_nmi()))
rctx = 3;
else if (in_irq())
rctx = 2;


[PATCH V2] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.85% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (gcc version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
---
 kernel/events/internal.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 486fd78eb8d5..e1a7ac7bd686 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
+   if (unlikely(in_nmi()))
rctx = 3;
else if (in_irq())
rctx = 2;



[PATCH V2] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.85% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (gcc version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer 
---
 kernel/events/internal.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 486fd78eb8d5..e1a7ac7bd686 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
+   if (unlikely(in_nmi()))
rctx = 3;
else if (in_irq())
rctx = 2;



Re: [PATCH] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
On Tue, 22 Aug 2017 19:00:39 +0200
Jesper Dangaard Brouer <bro...@redhat.com> wrote:

> On Tue, 22 Aug 2017 17:20:25 +0200
> Peter Zijlstra <pet...@infradead.org> wrote:
> 
> > On Tue, Aug 22, 2017 at 05:14:10PM +0200, Peter Zijlstra wrote:  
> > > On Tue, Aug 22, 2017 at 04:40:24PM +0200, Jesper Dangaard Brouer wrote:   
> > >  
> > > > In an XDP redirect applications using tracepoint xdp:xdp_redirect to
> > > > diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
> > > > was consuming 2% CPU. This was reduced to 1.6% with this simple
> > > > change.
> > > 
> > > It is also incorrect. What do you suppose it now returns when the NMI
> > > hits a hard IRQ which hit during a Soft IRQ?
> > 
> > Does this help any? I can imagine the compiler could struggle to CSE
> > preempt_count() seeing how its an asm thing.  
> 
> Nope, it does not help (see assembly below, with perf percentages).
> 
> But I think I can achieve that I want by a simple unlikely(in_nmi()) 
> annotation.

Like:

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 486fd78eb8d5..e1a7ac7bd686 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
+   if (unlikely(in_nmi()))
rctx = 3;
else if (in_irq())
rctx = 2;

Testing this show I get the expected result. Although, the 2% is
reduced to 1.85% (and not 1.6% as before).  


perf_swevent_get_recursion_context  /proc/kcore
   │
   │Disassembly of section load0:
   │
   │811465c0 :
  4.94 │  push   %rbp
  2.56 │  mov$0x14d20,%rax
 14.81 │  mov%rsp,%rbp
  3.47 │  add%gs:0x7eec3b5d(%rip),%rax
  0.91 │  lea0x34(%rax),%rdx
  1.46 │  mov%gs:0x7eec5db2(%rip),%eax
  8.04 │  test   $0x10,%eax
   │↓ jne59
  3.11 │  test   $0xf,%eax
   │↓ jne4d
  0.37 │  test   $0xff,%ah
  1.83 │  setne  %cl
  9.87 │  movzbl %cl,%eax
  2.01 │  movzbl %cl,%ecx
  1.65 │  shl$0x2,%rcx
  4.39 │3c:   add%rcx,%rdx
 29.62 │  mov(%rdx),%ecx
  2.93 │  test   %ecx,%ecx
   │↓ jne65
  0.55 │  movl   $0x1,(%rdx)
  2.56 │  pop%rbp
  4.94 │← retq
   │4d:   mov$0x8,%ecx
   │  mov$0x2,%eax
   │↑ jmp3c
   │59:   mov$0xc,%ecx
   │  mov$0x3,%eax
   │↑ jmp3c
   │65:   mov$0x,%eax
   │  pop%rbp
   │← retq



> > ---
> >  kernel/events/internal.h | 7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> > index 486fd78eb8d5..e0b5b8fa83a2 100644
> > --- a/kernel/events/internal.h
> > +++ b/kernel/events/internal.h
> > @@ -206,13 +206,14 @@ perf_callchain(struct perf_event *event, struct 
> > pt_regs *regs);
> >  
> >  static inline int get_recursion_context(int *recursion)
> >  {
> > +   unsigned int pc = preempt_count();
> > int rctx;
> >  
> > -   if (in_nmi())
> > +   if (pc & NMI_MASK)
> > rctx = 3;
> > -   else if (in_irq())
> > +   else if (pc & HARDIRQ_MASK)
> > rctx = 2;
> > -   else if (in_softirq())
> > +   else if (pc & SOFTIRQ_OFFSET)  
> 
> Hmmm... shouldn't this be SOFTIRQ_MASK?
> 
> > rctx = 1;
> > else
> > rctx = 0;  
> 
> perf_swevent_get_recursion_context  /proc/kcore
>│
>│
>│Disassembly of section load0:
>│
>│811465c0 :
>  13.32 │  push   %rbp
>   1.43 │  mov$0x14d20,%rax
>   5.12 │  mov%rsp,%rbp
>   6.56 │  add%gs:0x7eec3b5d(%rip),%rax
>   0.72 │  lea0x34(%rax),%rdx
>   0.31 │  mov%gs:0x7eec5db2(%rip),%eax
>   2.46 │  mov%eax,%ecx
>   6.86 │  and$0x7fff,%ecx
>   0.72 │  test   $0x10,%eax
>│↓ jne40
>│  test   $0xf,%eax
>   0.41 │↓ je 5b
>│  mov$0x8,%ecx
>│  mov$0x2,%eax
>│↓ jmp4a
>│40:   mov$0xc,%ecx
>│  mov$0x3,%eax
>   2.05 │4a:   add%rcx,%rdx
>  16.60 │  mov(%rdx),%ecx
>   2.66 │  test   %ecx,%ecx
>│↓ jne6d
>   1.33 │  movl   $0x1,(%rdx)
>   1.54 │  pop%rbp
>   4.51 │← retq
>   3.89 │5b:   shr$0x8,%ecx
>   9.53 │  and$0x1,%ecx
>   0.61 │  movzbl %cl,%eax
>   0.92 │  movzbl %cl,%ecx
>   4.30 │  shl$0x2,%rcx
>  14.14 │↑ jmp4a
>│6d:   mov$0x,%eax
>│  pop%rbp
>│← retq
>│  xchg   %ax,%ax
> 
> 
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
On Tue, 22 Aug 2017 19:00:39 +0200
Jesper Dangaard Brouer  wrote:

> On Tue, 22 Aug 2017 17:20:25 +0200
> Peter Zijlstra  wrote:
> 
> > On Tue, Aug 22, 2017 at 05:14:10PM +0200, Peter Zijlstra wrote:  
> > > On Tue, Aug 22, 2017 at 04:40:24PM +0200, Jesper Dangaard Brouer wrote:   
> > >  
> > > > In an XDP redirect applications using tracepoint xdp:xdp_redirect to
> > > > diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
> > > > was consuming 2% CPU. This was reduced to 1.6% with this simple
> > > > change.
> > > 
> > > It is also incorrect. What do you suppose it now returns when the NMI
> > > hits a hard IRQ which hit during a Soft IRQ?
> > 
> > Does this help any? I can imagine the compiler could struggle to CSE
> > preempt_count() seeing how its an asm thing.  
> 
> Nope, it does not help (see assembly below, with perf percentages).
> 
> But I think I can achieve that I want by a simple unlikely(in_nmi()) 
> annotation.

Like:

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 486fd78eb8d5..e1a7ac7bd686 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
+   if (unlikely(in_nmi()))
rctx = 3;
else if (in_irq())
rctx = 2;

Testing this show I get the expected result. Although, the 2% is
reduced to 1.85% (and not 1.6% as before).  


perf_swevent_get_recursion_context  /proc/kcore
   │
   │Disassembly of section load0:
   │
   │811465c0 :
  4.94 │  push   %rbp
  2.56 │  mov$0x14d20,%rax
 14.81 │  mov%rsp,%rbp
  3.47 │  add%gs:0x7eec3b5d(%rip),%rax
  0.91 │  lea0x34(%rax),%rdx
  1.46 │  mov%gs:0x7eec5db2(%rip),%eax
  8.04 │  test   $0x10,%eax
   │↓ jne59
  3.11 │  test   $0xf,%eax
   │↓ jne4d
  0.37 │  test   $0xff,%ah
  1.83 │  setne  %cl
  9.87 │  movzbl %cl,%eax
  2.01 │  movzbl %cl,%ecx
  1.65 │  shl$0x2,%rcx
  4.39 │3c:   add%rcx,%rdx
 29.62 │  mov(%rdx),%ecx
  2.93 │  test   %ecx,%ecx
   │↓ jne65
  0.55 │  movl   $0x1,(%rdx)
  2.56 │  pop%rbp
  4.94 │← retq
   │4d:   mov$0x8,%ecx
   │  mov$0x2,%eax
   │↑ jmp3c
   │59:   mov$0xc,%ecx
   │  mov$0x3,%eax
   │↑ jmp3c
   │65:   mov$0x,%eax
   │  pop%rbp
   │← retq



> > ---
> >  kernel/events/internal.h | 7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> > index 486fd78eb8d5..e0b5b8fa83a2 100644
> > --- a/kernel/events/internal.h
> > +++ b/kernel/events/internal.h
> > @@ -206,13 +206,14 @@ perf_callchain(struct perf_event *event, struct 
> > pt_regs *regs);
> >  
> >  static inline int get_recursion_context(int *recursion)
> >  {
> > +   unsigned int pc = preempt_count();
> > int rctx;
> >  
> > -   if (in_nmi())
> > +   if (pc & NMI_MASK)
> > rctx = 3;
> > -   else if (in_irq())
> > +   else if (pc & HARDIRQ_MASK)
> > rctx = 2;
> > -   else if (in_softirq())
> > +   else if (pc & SOFTIRQ_OFFSET)  
> 
> Hmmm... shouldn't this be SOFTIRQ_MASK?
> 
> > rctx = 1;
> > else
> > rctx = 0;  
> 
> perf_swevent_get_recursion_context  /proc/kcore
>│
>│
>│Disassembly of section load0:
>│
>│811465c0 :
>  13.32 │  push   %rbp
>   1.43 │  mov$0x14d20,%rax
>   5.12 │  mov%rsp,%rbp
>   6.56 │  add%gs:0x7eec3b5d(%rip),%rax
>   0.72 │  lea0x34(%rax),%rdx
>   0.31 │  mov%gs:0x7eec5db2(%rip),%eax
>   2.46 │  mov%eax,%ecx
>   6.86 │  and$0x7fff,%ecx
>   0.72 │  test   $0x10,%eax
>│↓ jne40
>│  test   $0xf,%eax
>   0.41 │↓ je 5b
>│  mov$0x8,%ecx
>│  mov$0x2,%eax
>│↓ jmp4a
>│40:   mov$0xc,%ecx
>│  mov$0x3,%eax
>   2.05 │4a:   add%rcx,%rdx
>  16.60 │  mov(%rdx),%ecx
>   2.66 │  test   %ecx,%ecx
>│↓ jne6d
>   1.33 │  movl   $0x1,(%rdx)
>   1.54 │  pop%rbp
>   4.51 │← retq
>   3.89 │5b:   shr$0x8,%ecx
>   9.53 │  and$0x1,%ecx
>   0.61 │  movzbl %cl,%eax
>   0.92 │  movzbl %cl,%ecx
>   4.30 │  shl$0x2,%rcx
>  14.14 │↑ jmp4a
>│6d:   mov$0x,%eax
>│  pop%rbp
>│← retq
>│  xchg   %ax,%ax
> 
> 
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
On Tue, 22 Aug 2017 17:20:25 +0200
Peter Zijlstra <pet...@infradead.org> wrote:

> On Tue, Aug 22, 2017 at 05:14:10PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 22, 2017 at 04:40:24PM +0200, Jesper Dangaard Brouer wrote:  
> > > In an XDP redirect applications using tracepoint xdp:xdp_redirect to
> > > diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
> > > was consuming 2% CPU. This was reduced to 1.6% with this simple
> > > change.  
> > 
> > It is also incorrect. What do you suppose it now returns when the NMI
> > hits a hard IRQ which hit during a Soft IRQ?  
> 
> Does this help any? I can imagine the compiler could struggle to CSE
> preempt_count() seeing how its an asm thing.

Nope, it does not help (see assembly below, with perf percentages).

But I think I can achieve that I want by a simple unlikely(in_nmi()) annotation.

> ---
>  kernel/events/internal.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 486fd78eb8d5..e0b5b8fa83a2 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -206,13 +206,14 @@ perf_callchain(struct perf_event *event, struct pt_regs 
> *regs);
>  
>  static inline int get_recursion_context(int *recursion)
>  {
> + unsigned int pc = preempt_count();
>   int rctx;
>  
> - if (in_nmi())
> + if (pc & NMI_MASK)
>   rctx = 3;
> - else if (in_irq())
> + else if (pc & HARDIRQ_MASK)
>   rctx = 2;
> - else if (in_softirq())
> + else if (pc & SOFTIRQ_OFFSET)

Hmmm... shouldn't this be SOFTIRQ_MASK?

>   rctx = 1;
>   else
>   rctx = 0;

perf_swevent_get_recursion_context  /proc/kcore
   │
   │
   │Disassembly of section load0:
   │
   │811465c0 :
 13.32 │  push   %rbp
  1.43 │  mov$0x14d20,%rax
  5.12 │  mov%rsp,%rbp
  6.56 │  add%gs:0x7eec3b5d(%rip),%rax
  0.72 │  lea0x34(%rax),%rdx
  0.31 │  mov%gs:0x7eec5db2(%rip),%eax
  2.46 │  mov%eax,%ecx
  6.86 │  and$0x7fff,%ecx
  0.72 │  test   $0x10,%eax
   │↓ jne40
   │  test   $0xf,%eax
  0.41 │↓ je 5b
   │  mov$0x8,%ecx
   │  mov$0x2,%eax
   │↓ jmp4a
   │40:   mov$0xc,%ecx
   │  mov$0x3,%eax
  2.05 │4a:   add%rcx,%rdx
 16.60 │  mov(%rdx),%ecx
  2.66 │  test   %ecx,%ecx
   │↓ jne6d
  1.33 │  movl   $0x1,(%rdx)
  1.54 │  pop%rbp
  4.51 │← retq
  3.89 │5b:   shr$0x8,%ecx
  9.53 │  and$0x1,%ecx
  0.61 │  movzbl %cl,%eax
  0.92 │  movzbl %cl,%ecx
  4.30 │  shl    $0x2,%rcx
 14.14 │    ↑ jmp4a
   │6d:   mov$0x,%eax
   │  pop%rbp
   │← retq
   │  xchg   %ax,%ax



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
On Tue, 22 Aug 2017 17:20:25 +0200
Peter Zijlstra  wrote:

> On Tue, Aug 22, 2017 at 05:14:10PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 22, 2017 at 04:40:24PM +0200, Jesper Dangaard Brouer wrote:  
> > > In an XDP redirect applications using tracepoint xdp:xdp_redirect to
> > > diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
> > > was consuming 2% CPU. This was reduced to 1.6% with this simple
> > > change.  
> > 
> > It is also incorrect. What do you suppose it now returns when the NMI
> > hits a hard IRQ which hit during a Soft IRQ?  
> 
> Does this help any? I can imagine the compiler could struggle to CSE
> preempt_count() seeing how its an asm thing.

Nope, it does not help (see assembly below, with perf percentages).

But I think I can achieve that I want by a simple unlikely(in_nmi()) annotation.

> ---
>  kernel/events/internal.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 486fd78eb8d5..e0b5b8fa83a2 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -206,13 +206,14 @@ perf_callchain(struct perf_event *event, struct pt_regs 
> *regs);
>  
>  static inline int get_recursion_context(int *recursion)
>  {
> + unsigned int pc = preempt_count();
>   int rctx;
>  
> - if (in_nmi())
> + if (pc & NMI_MASK)
>   rctx = 3;
> - else if (in_irq())
> + else if (pc & HARDIRQ_MASK)
>   rctx = 2;
> - else if (in_softirq())
> + else if (pc & SOFTIRQ_OFFSET)

Hmmm... shouldn't this be SOFTIRQ_MASK?

>   rctx = 1;
>   else
>   rctx = 0;

perf_swevent_get_recursion_context  /proc/kcore
   │
   │
   │Disassembly of section load0:
   │
   │811465c0 :
 13.32 │  push   %rbp
  1.43 │  mov$0x14d20,%rax
  5.12 │  mov%rsp,%rbp
  6.56 │  add%gs:0x7eec3b5d(%rip),%rax
  0.72 │  lea0x34(%rax),%rdx
  0.31 │  mov%gs:0x7eec5db2(%rip),%eax
  2.46 │  mov%eax,%ecx
  6.86 │  and$0x7fff,%ecx
  0.72 │  test   $0x10,%eax
   │↓ jne40
   │  test   $0xf,%eax
  0.41 │↓ je 5b
   │  mov$0x8,%ecx
   │  mov$0x2,%eax
   │↓ jmp4a
   │40:   mov$0xc,%ecx
   │  mov$0x3,%eax
  2.05 │4a:   add%rcx,%rdx
 16.60 │  mov(%rdx),%ecx
  2.66 │  test   %ecx,%ecx
   │↓ jne6d
  1.33 │  movl   $0x1,(%rdx)
  1.54 │  pop%rbp
  4.51 │← retq
  3.89 │5b:   shr$0x8,%ecx
  9.53 │  and$0x1,%ecx
  0.61 │  movzbl %cl,%eax
  0.92 │  movzbl %cl,%ecx
  4.30 │  shl$0x2,%rcx
 14.14 │    ↑ jmp4a
   │6d:   mov$0x,%eax
   │  pop%rbp
   │← retq
   │  xchg   %ax,%ax



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[PATCH] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.6% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (gcc version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
---
 kernel/events/internal.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 486fd78eb8d5..56aa462760fa 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,12 +208,12 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
-   rctx = 3;
+   if (in_softirq())
+   rctx = 1;
else if (in_irq())
rctx = 2;
-   else if (in_softirq())
-   rctx = 1;
+   else if (in_nmi())
+   rctx = 3;
else
rctx = 0;
 



[PATCH] trace: adjust code layout in get_recursion_context

2017-08-22 Thread Jesper Dangaard Brouer
In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.6% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (gcc version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer 
---
 kernel/events/internal.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 486fd78eb8d5..56aa462760fa 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,12 +208,12 @@ static inline int get_recursion_context(int *recursion)
 {
int rctx;
 
-   if (in_nmi())
-   rctx = 3;
+   if (in_softirq())
+   rctx = 1;
else if (in_irq())
rctx = 2;
-   else if (in_softirq())
-   rctx = 1;
+   else if (in_nmi())
+   rctx = 3;
else
rctx = 0;
 



Re: [PATCH 0/2] Separate NUMA statistics from zone statistics

2017-08-15 Thread Jesper Dangaard Brouer
On Tue, 15 Aug 2017 16:45:34 +0800
Kemi Wang <kemi.w...@intel.com> wrote:

> Each page allocation updates a set of per-zone statistics with a call to
> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
 ^^ should be "summit"
> source of overhead in the page allocator and are very rarely consumed. This
> significant overhead in cache bouncing caused by zone counters (NUMA
> associated counters) update in parallel in multi-threaded page allocation
> (pointed out by Dave Hansen).

Hi Kemi

Thanks a lot for following up on this work. A link to the MM summit slides:
 
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

> To mitigate this overhead, this patchset separates NUMA statistics from
> zone statistics framework, and update NUMA counter threshold to a fixed
> size of 32765, as a small threshold greatly increases the update frequency
> of the global counter from local per cpu counter (suggested by Ying Huang).
> The rationality is that these statistics counters don't need to be read
> often, unlike other VM counters, so it's not a problem to use a large
> threshold and make readers more expensive.
> 
> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
> for per single page allocation and reclaim on Jesper's page_bench03
> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
> statistics with little end-user-visible effects (see the first patch for
> details), except that the number of NUMA items in each cpu
> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
> the value of NUMA counter to eliminate deviation.

I'm very happy to see that you found my kernel module for benchmarking useful 
:-)

> I did an experiment of single page allocation and reclaim concurrently
> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
> (88 processors with 126G memory) with different size of threshold of pcp
> counter.
> 
> Benchmark provided by Jesper D Broucer(increase loop times to 1000):
 ^^^
You mis-spelled my last name, it is "Brouer".

> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> 
>Threshold   CPU cyclesThroughput(88 threads)
>   32799 241760478
>   64640 301628829
>   125   537 358906028 <==> system by default
>   256   468 412397590
>   512   428 450550704
>   4096  399 482520943
>   2 394 489009617
>   3 395 488017817
>   32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>   N/A   342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Kemi Wang (2):
>   mm: Change the call sites of numa statistics items
>   mm: Update NUMA counter threshold size
> 
>  drivers/base/node.c|  22 ---
>  include/linux/mmzone.h |  25 +---
>  include/linux/vmstat.h |  33 ++
>  mm/page_alloc.c    |  10 +--
>  mm/vmstat.c| 162 
> +++--
>  5 files changed, 227 insertions(+), 25 deletions(-)
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 0/2] Separate NUMA statistics from zone statistics

2017-08-15 Thread Jesper Dangaard Brouer
On Tue, 15 Aug 2017 16:45:34 +0800
Kemi Wang  wrote:

> Each page allocation updates a set of per-zone statistics with a call to
> zone_statistics(). As discussed in 2017 MM submit, these are a substantial
 ^^ should be "summit"
> source of overhead in the page allocator and are very rarely consumed. This
> significant overhead in cache bouncing caused by zone counters (NUMA
> associated counters) update in parallel in multi-threaded page allocation
> (pointed out by Dave Hansen).

Hi Kemi

Thanks a lot for following up on this work. A link to the MM summit slides:
 
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

> To mitigate this overhead, this patchset separates NUMA statistics from
> zone statistics framework, and update NUMA counter threshold to a fixed
> size of 32765, as a small threshold greatly increases the update frequency
> of the global counter from local per cpu counter (suggested by Ying Huang).
> The rationality is that these statistics counters don't need to be read
> often, unlike other VM counters, so it's not a problem to use a large
> threshold and make readers more expensive.
> 
> With this patchset, we see 26.6% drop of CPU cycles(537-->394, see below)
> for per single page allocation and reclaim on Jesper's page_bench03
> benchmark. Meanwhile, this patchset keeps the same style of virtual memory
> statistics with little end-user-visible effects (see the first patch for
> details), except that the number of NUMA items in each cpu
> (vm_numa_stat_diff[]) is added to zone->vm_numa_stat[] when a user *reads*
> the value of NUMA counter to eliminate deviation.

I'm very happy to see that you found my kernel module for benchmarking useful 
:-)

> I did an experiment of single page allocation and reclaim concurrently
> using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server
> (88 processors with 126G memory) with different size of threshold of pcp
> counter.
> 
> Benchmark provided by Jesper D Broucer(increase loop times to 1000):
 ^^^
You mis-spelled my last name, it is "Brouer".

> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> 
>Threshold   CPU cyclesThroughput(88 threads)
>   32799 241760478
>   64640 301628829
>   125   537 358906028 <==> system by default
>   256   468 412397590
>   512   428 450550704
>   4096  399 482520943
>   2 394 489009617
>   3 395 488017817
>   32765 394(-26.6%) 488932078(+36.2%) <==> with this patchset
>   N/A   342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
> 
> Kemi Wang (2):
>   mm: Change the call sites of numa statistics items
>   mm: Update NUMA counter threshold size
> 
>  drivers/base/node.c|  22 ---
>  include/linux/mmzone.h |  25 +---
>  include/linux/vmstat.h |  33 ++
>  mm/page_alloc.c|  10 +--
>  mm/vmstat.c| 162 
> +++--
>  5 files changed, 227 insertions(+), 25 deletions(-)
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[PATCH] trivial: tracing: spelling fixes for CONFIG_HWLAT_TRACER

2017-06-13 Thread Jesper Dangaard Brouer
Trivial spelling fixes for Kconfig help text of config HWLAT_TRACER.

Fixes: e7c15cd8a113 ("tracing: Added hardware latency tracer")
Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
Acked-by: Steven Rostedt <rost...@goodmis.org>
---
Resend of: https://patchwork.kernel.org/patch/9405223/

 kernel/trace/Kconfig |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 7e06f04e98fe..f4215132cb91 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -224,7 +224,7 @@ config HWLAT_TRACER
select GENERIC_TRACER
help
 This tracer, when enabled will create one or more kernel threads,
-depening on what the cpumask file is set to, which each thread
+depending on what the cpumask file is set to, which each thread
 spinning in a loop looking for interruptions caused by
 something other than the kernel. For example, if a
 System Management Interrupt (SMI) takes a noticeable amount of
@@ -239,7 +239,7 @@ config HWLAT_TRACER
 iteration
 
 A kernel thread is created that will spin with interrupts disabled
-for "width" microseconds in every "widow" cycle. It will not spin
+for "width" microseconds in every "window" cycle. It will not spin
 for "window - width" microseconds, where the system can
 continue to operate.
 



[PATCH] trivial: tracing: spelling fixes for CONFIG_HWLAT_TRACER

2017-06-13 Thread Jesper Dangaard Brouer
Trivial spelling fixes for Kconfig help text of config HWLAT_TRACER.

Fixes: e7c15cd8a113 ("tracing: Added hardware latency tracer")
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Steven Rostedt 
---
Resend of: https://patchwork.kernel.org/patch/9405223/

 kernel/trace/Kconfig |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 7e06f04e98fe..f4215132cb91 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -224,7 +224,7 @@ config HWLAT_TRACER
select GENERIC_TRACER
help
 This tracer, when enabled will create one or more kernel threads,
-depening on what the cpumask file is set to, which each thread
+depending on what the cpumask file is set to, which each thread
 spinning in a loop looking for interruptions caused by
 something other than the kernel. For example, if a
 System Management Interrupt (SMI) takes a noticeable amount of
@@ -239,7 +239,7 @@ config HWLAT_TRACER
 iteration
 
 A kernel thread is created that will spin with interrupts disabled
-for "width" microseconds in every "widow" cycle. It will not spin
+for "width" microseconds in every "window" cycle. It will not spin
 for "window - width" microseconds, where the system can
 continue to operate.
 



Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Jesper Dangaard Brouer
On Tue, 9 May 2017 16:33:14 +0300
"Michael S. Tsirkin" <m...@redhat.com> wrote:

> On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 7 Apr 2017 08:49:57 +0300
> > "Michael S. Tsirkin" <m...@redhat.com> wrote:
> >   
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >   that the following call to produce will succeed.
> > >   No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >   users that would previously send interrups to the producer
> > >   to wake it up after consuming each entry would now only
> > >   need to do this once in a batch.
> > >   Doing this would be easy by returning a flag to the caller.
> > >   No users seem to do signalling on consume yet so this was not
> > >   implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > > 
> > >  include/linux/ptr_ring.h | 63 
> > > +---
> > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > index 6c70444..6b2e0dd 100644
> > > --- a/include/linux/ptr_ring.h
> > > +++ b/include/linux/ptr_ring.h
> > > @@ -34,11 +34,13 @@
> > >  struct ptr_ring {
> > >   int producer cacheline_aligned_in_smp;
> > >   spinlock_t producer_lock;
> > > - int consumer cacheline_aligned_in_smp;
> > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> > > + int consumer_tail; /* next entry to invalidate */
> > >   spinlock_t consumer_lock;
> > >   /* Shared consumer/producer data */
> > >   /* Read-only by both the producer and the consumer */
> > >   int size cacheline_aligned_in_smp; /* max entries in queue */
> > > + int batch; /* number of entries to consume in a batch */
> > >   void **queue;
> > >  };
> > >  
> > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring 
> > > *r, void *ptr)
> > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > >  {
> > >   if (likely(r->size))
> > > - return r->queue[r->consumer];
> > > + return r->queue[r->consumer_head];
> > >   return NULL;
> > >  }
> > >  
> > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring 
> > > *r)
> > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > >  {
> > > - r->queue[r->consumer++] = NULL;
> > > - if (unlikely(r->consumer >= r->size))
> > > - r->consumer = 0;
> > > + /* Fundamentally, what we want to do is update consumer
> > > +  * index and zero out the entry so producer can reuse it.
> > > +  * Doing it naively at each consume would be as simple as:
> > > +  *   r->queue[r->consumer++] = NULL;
> > > +  *   if (

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Jesper Dangaard Brouer
On Tue, 9 May 2017 16:33:14 +0300
"Michael S. Tsirkin"  wrote:

> On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 7 Apr 2017 08:49:57 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >   that the following call to produce will succeed.
> > >   No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >   users that would previously send interrups to the producer
> > >   to wake it up after consuming each entry would now only
> > >   need to do this once in a batch.
> > >   Doing this would be easy by returning a flag to the caller.
> > >   No users seem to do signalling on consume yet so this was not
> > >   implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > > 
> > >  include/linux/ptr_ring.h | 63 
> > > +---
> > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > index 6c70444..6b2e0dd 100644
> > > --- a/include/linux/ptr_ring.h
> > > +++ b/include/linux/ptr_ring.h
> > > @@ -34,11 +34,13 @@
> > >  struct ptr_ring {
> > >   int producer cacheline_aligned_in_smp;
> > >   spinlock_t producer_lock;
> > > - int consumer cacheline_aligned_in_smp;
> > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> > > + int consumer_tail; /* next entry to invalidate */
> > >   spinlock_t consumer_lock;
> > >   /* Shared consumer/producer data */
> > >   /* Read-only by both the producer and the consumer */
> > >   int size cacheline_aligned_in_smp; /* max entries in queue */
> > > + int batch; /* number of entries to consume in a batch */
> > >   void **queue;
> > >  };
> > >  
> > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring 
> > > *r, void *ptr)
> > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > >  {
> > >   if (likely(r->size))
> > > - return r->queue[r->consumer];
> > > + return r->queue[r->consumer_head];
> > >   return NULL;
> > >  }
> > >  
> > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring 
> > > *r)
> > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > >  {
> > > - r->queue[r->consumer++] = NULL;
> > > - if (unlikely(r->consumer >= r->size))
> > > - r->consumer = 0;
> > > + /* Fundamentally, what we want to do is update consumer
> > > +  * index and zero out the entry so producer can reuse it.
> > > +  * Doing it naively at each consume would be as simple as:
> > > +  *   r->queue[r->consumer++] = NULL;
> > > +  *   if (unlikely(r->consumer >= r->size))
> > > +  * 

Re: [PATCH] sched/cputime: Fix ksoftirqd cputime accounting regression

2017-04-26 Thread Jesper Dangaard Brouer
On Tue, 25 Apr 2017 16:10:48 +0200
Frederic Weisbecker <fweis...@gmail.com> wrote:

> irq_time_read() returns the irqtime minus the ksoftirqd time. This
> is necessary because irq_time_read() is used to substract the IRQ time
> from the sum_exec_runtime of a task. If we were to include the softirq
> time of ksoftirqd, this task would substract its own CPU time everytime
> it updates ksoftirqd->sum_exec_runtime which would therefore never
> progress.
> 
> But this behaviour got broken by commit a499a5a14db:
>   ("sched/cputime: Increment kcpustat directly on irqtime account")
> which now includes ksoftirqd softirq time in the time returned by
> irq_time_read().
> 
> This has resulted in wrong ksotfirqd cputime reported to userspace
 ^
Typo in commit message, guess maintainer can fix/amend that.

> through /proc/stat and thus "top" not showing ksoftirqd when it should
> after intense networking load.
> 
> ksoftirqd->stime happens to be correct but it gets scaled down by
> sum_exec_runtime through task_cputime_adjusted().
> 
> To fix this, just account the strict IRQ time in a separate counter and
> use it to report the IRQ time.
> 
> Reported-and-tested-by: Jesper Dangaard Brouer <bro...@redhat.com>

Acked-by: Jesper Dangaard Brouer <bro...@redhat.com>

Confirming I have tested patch before Frederic posted it, and have been
running with this applied on my net-next testlab kernels since Monday.

And it does fix the bisected regression I reported here:
  http://lkml.kernel.org/r/20170328101403.34a82...@redhat.com


> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Stanislaw Gruszka <sgrus...@redhat.com>
> Cc: Wanpeng Li <wanpeng...@hotmail.com>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Linus Torvalds <torva...@linux-foundation.org>

Please add:

Fixed: a499a5a14dbd ("sched/cputime: Increment kcpustat directly on irqtime 
account")

--Jesper


> Signed-off-by: Frederic Weisbecker <fweis...@gmail.com>

> ---
>  kernel/sched/cputime.c | 27 ---
>  kernel/sched/sched.h   |  9 +++--
>  2 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..aea3135 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -34,6 +34,18 @@ void disable_sched_clock_irqtime(void)
>   sched_clock_irqtime = 0;
>  }
>  
> +static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
> +   enum cpu_usage_stat idx)
> +{
> + u64 *cpustat = kcpustat_this_cpu->cpustat;
> +
> + u64_stats_update_begin(>sync);
> + cpustat[idx] += delta;
> + irqtime->total += delta;
> + irqtime->tick_delta += delta;
> + u64_stats_update_end(>sync);
> +}
> +
>  /*
>   * Called before incrementing preempt_count on {soft,}irq_enter
>   * and before decrementing preempt_count on {soft,}irq_exit.
> @@ -41,7 +53,6 @@ void disable_sched_clock_irqtime(void)
>  void irqtime_account_irq(struct task_struct *curr)
>  {
>   struct irqtime *irqtime = this_cpu_ptr(_irqtime);
> - u64 *cpustat = kcpustat_this_cpu->cpustat;
>   s64 delta;
>   int cpu;
>  
> @@ -52,22 +63,16 @@ void irqtime_account_irq(struct task_struct *curr)
>   delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
>   irqtime->irq_start_time += delta;
>  
> - u64_stats_update_begin(>sync);
>   /*
>* We do not account for softirq time from ksoftirqd here.
>* We want to continue accounting softirq time to ksoftirqd thread
>* in that case, so as not to confuse scheduler with a special task
>* that do not consume any time, but still wants to run.
>*/
> - if (hardirq_count()) {
> - cpustat[CPUTIME_IRQ] += delta;
> - irqtime->tick_delta += delta;
> - } else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) {
> - cpustat[CPUTIME_SOFTIRQ] += delta;
> - irqtime->tick_delta += delta;
> - }
> -
> - u64_stats_update_end(>sync);
> + if (hardirq_count())
> + irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
> + else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
> + irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5cbf922..767aab3 100644
> --- a/kernel/sched/sched.

Re: [PATCH] sched/cputime: Fix ksoftirqd cputime accounting regression

2017-04-26 Thread Jesper Dangaard Brouer
On Tue, 25 Apr 2017 16:10:48 +0200
Frederic Weisbecker  wrote:

> irq_time_read() returns the irqtime minus the ksoftirqd time. This
> is necessary because irq_time_read() is used to substract the IRQ time
> from the sum_exec_runtime of a task. If we were to include the softirq
> time of ksoftirqd, this task would substract its own CPU time everytime
> it updates ksoftirqd->sum_exec_runtime which would therefore never
> progress.
> 
> But this behaviour got broken by commit a499a5a14db:
>   ("sched/cputime: Increment kcpustat directly on irqtime account")
> which now includes ksoftirqd softirq time in the time returned by
> irq_time_read().
> 
> This has resulted in wrong ksotfirqd cputime reported to userspace
 ^
Typo in commit message, guess maintainer can fix/amend that.

> through /proc/stat and thus "top" not showing ksoftirqd when it should
> after intense networking load.
> 
> ksoftirqd->stime happens to be correct but it gets scaled down by
> sum_exec_runtime through task_cputime_adjusted().
> 
> To fix this, just account the strict IRQ time in a separate counter and
> use it to report the IRQ time.
> 
> Reported-and-tested-by: Jesper Dangaard Brouer 

Acked-by: Jesper Dangaard Brouer 

Confirming I have tested patch before Frederic posted it, and have been
running with this applied on my net-next testlab kernels since Monday.

And it does fix the bisected regression I reported here:
  http://lkml.kernel.org/r/20170328101403.34a82...@redhat.com


> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Stanislaw Gruszka 
> Cc: Wanpeng Li 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Linus Torvalds 

Please add:

Fixed: a499a5a14dbd ("sched/cputime: Increment kcpustat directly on irqtime 
account")

--Jesper


> Signed-off-by: Frederic Weisbecker 

> ---
>  kernel/sched/cputime.c | 27 ---
>  kernel/sched/sched.h   |  9 +++--
>  2 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..aea3135 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -34,6 +34,18 @@ void disable_sched_clock_irqtime(void)
>   sched_clock_irqtime = 0;
>  }
>  
> +static void irqtime_account_delta(struct irqtime *irqtime, u64 delta,
> +   enum cpu_usage_stat idx)
> +{
> + u64 *cpustat = kcpustat_this_cpu->cpustat;
> +
> + u64_stats_update_begin(>sync);
> + cpustat[idx] += delta;
> + irqtime->total += delta;
> + irqtime->tick_delta += delta;
> + u64_stats_update_end(>sync);
> +}
> +
>  /*
>   * Called before incrementing preempt_count on {soft,}irq_enter
>   * and before decrementing preempt_count on {soft,}irq_exit.
> @@ -41,7 +53,6 @@ void disable_sched_clock_irqtime(void)
>  void irqtime_account_irq(struct task_struct *curr)
>  {
>   struct irqtime *irqtime = this_cpu_ptr(_irqtime);
> - u64 *cpustat = kcpustat_this_cpu->cpustat;
>   s64 delta;
>   int cpu;
>  
> @@ -52,22 +63,16 @@ void irqtime_account_irq(struct task_struct *curr)
>   delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
>   irqtime->irq_start_time += delta;
>  
> - u64_stats_update_begin(>sync);
>   /*
>* We do not account for softirq time from ksoftirqd here.
>* We want to continue accounting softirq time to ksoftirqd thread
>* in that case, so as not to confuse scheduler with a special task
>* that do not consume any time, but still wants to run.
>*/
> - if (hardirq_count()) {
> - cpustat[CPUTIME_IRQ] += delta;
> - irqtime->tick_delta += delta;
> - } else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) {
> - cpustat[CPUTIME_SOFTIRQ] += delta;
> - irqtime->tick_delta += delta;
> - }
> -
> - u64_stats_update_end(>sync);
> + if (hardirq_count())
> + irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
> + else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
> + irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5cbf922..767aab3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1869,6 +1869,7 @@ static inline void nohz_balance_exit_idle(unsigned int 
> cpu) { }
>  
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>  struct irqtime {
> + u64 total;
>   u64 tick_delta;
>   u64  

Heads-up: two regressions in v4.11-rc series

2017-04-20 Thread Jesper Dangaard Brouer
Hi Linus,

Just wanted to give a heads-up on two regressions in 4.11-rc series.

(1) page allocator optimization revert

Mel Gorman and I have been playing with optimizing the page allocator,
but Tariq spotted that we caused a regression for (NIC) drivers that
refill DMA RX rings in softirq context.

The end result was a revert, and this is waiting in AKPMs quilt queue:
 
http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch


(2) Busy softirq can cause userspace not to be scheduled

I bisected the problem to a499a5a14dbd ("sched/cputime: Increment
kcpustat directly on irqtime account"). See email thread with
 Subject: Bisected softirq accounting issue in v4.11-rc1~170^2~28
 http://lkml.kernel.org/r/20170328101403.34a82...@redhat.com

I don't know the scheduler code well enough to fix this, and will have
to rely others to figure out this scheduler regression.

To make it clear: I'm only seeing this scheduler regression when a
remote host is sending many many network packets, towards the kernel
which keeps NAPI/softirq busy all the time.  A possible hint: tool
"top" only shows this in "si" column, while on v4.10 "top" also blames
"ksoftirqd/N", plus "ps" reported cputime (0:00) seems wrong for ksoftirqd.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Heads-up: two regressions in v4.11-rc series

2017-04-20 Thread Jesper Dangaard Brouer
Hi Linus,

Just wanted to give a heads-up on two regressions in 4.11-rc series.

(1) page allocator optimization revert

Mel Gorman and I have been playing with optimizing the page allocator,
but Tariq spotted that we caused a regression for (NIC) drivers that
refill DMA RX rings in softirq context.

The end result was a revert, and this is waiting in AKPMs quilt queue:
 
http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch


(2) Busy softirq can cause userspace not to be scheduled

I bisected the problem to a499a5a14dbd ("sched/cputime: Increment
kcpustat directly on irqtime account"). See email thread with
 Subject: Bisected softirq accounting issue in v4.11-rc1~170^2~28
 http://lkml.kernel.org/r/20170328101403.34a82...@redhat.com

I don't know the scheduler code well enough to fix this, and will have
to rely others to figure out this scheduler regression.

To make it clear: I'm only seeing this scheduler regression when a
remote host is sending many many network packets, towards the kernel
which keeps NAPI/softirq busy all the time.  A possible hint: tool
"top" only shows this in "si" column, while on v4.10 "top" also blames
"ksoftirqd/N", plus "ps" reported cputime (0:00) seems wrong for ksoftirqd.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] Revert "mm, page_alloc: only use per-cpu allocator for irq-safe requests"

2017-04-15 Thread Jesper Dangaard Brouer
On Sat, 15 Apr 2017 15:53:50 +0100
Mel Gorman <mgor...@techsingularity.net> wrote:

> This reverts commit 374ad05ab64d696303cec5cc8ec3a65d457b7b1c. While the
> patch worked great for userspace allocations, the fact that softirq loses
> the per-cpu allocator caused problems. It needs to be redone taking into
> account that a separate list is needed for hard/soft IRQs or alternatively
> find a cheap way of detecting reentry due to an interrupt. Both are possible
> but sufficiently tricky that it shouldn't be rushed. Jesper had one method
> for allowing softirqs but reported that the cost was high enough that it
> performed similarly to a plain revert. His figures for netperf TCP_STREAM
> were as follows
> 
> Baseline v4.10.0  : 60316 Mbit/s
> Current 4.11.0-rc6: 47491 Mbit/s
> This patch: 60662 Mbit/s
(should instead state "Jesper's patch" or "His patch")

Ran same test (8 parallel netperf TCP_STREAMs) with this patch applied:

 This patch 60106 Mbit/s (average of 7 iteration 60 sec runs)

With these speeds I'm starting to hit the memory bandwidth of my machines.
Thus, the 60 GBit/s measurement cannot be used to validate the
performance impact of reverting this compared to my softirq patch, it
only shows we fixed the regression.  (I'm suspicious as I see a higher
contention on the page allocator lock (4% vs 1.3%) with this patch and
still same performance... but lets worry about that outside the rc-series).

I would be interested in Tariq to re-run these benchmarks on some
hardware with enough memory bandwidth for 100Gbit/s throughput.


> As this is a regression, I wish to revert to noirq allocator for now and
> go back to the drawing board.
> 
> Signed-off-by: Mel Gorman <mgor...@techsingularity.net>
> Reported-by: Tariq Toukan <ttoukan.li...@gmail.com>

Acked-by: Jesper Dangaard Brouer <bro...@redhat.com>

> ---
>  mm/page_alloc.c | 43 ---
>  1 file changed, 20 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cbde310abed..3bba4f46214c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,10 +1090,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>  {
>   int migratetype = 0;
>   int batch_free = 0;
> - unsigned long nr_scanned, flags;
> + unsigned long nr_scanned;
>   bool isolated_pageblocks;
>  
> - spin_lock_irqsave(>lock, flags);
> + spin_lock(>lock);
>   isolated_pageblocks = has_isolate_pageblock(zone);
>   nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>   if (nr_scanned)
> @@ -1142,7 +1142,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   trace_mm_page_pcpu_drain(page, 0, mt);
>   } while (--count && --batch_free && !list_empty(list));
>   }
> - spin_unlock_irqrestore(>lock, flags);
> + spin_unlock(>lock);
>  }
>  
>  static void free_one_page(struct zone *zone,
> @@ -1150,9 +1150,8 @@ static void free_one_page(struct zone *zone,
>   unsigned int order,
>   int migratetype)
>  {
> - unsigned long nr_scanned, flags;
> - spin_lock_irqsave(>lock, flags);
> - __count_vm_events(PGFREE, 1 << order);
> + unsigned long nr_scanned;
> + spin_lock(>lock);
>   nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>   if (nr_scanned)
>   __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, 
> -nr_scanned);
> @@ -1162,7 +1161,7 @@ static void free_one_page(struct zone *zone,
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   }
>   __free_one_page(page, pfn, zone, order, migratetype);
> - spin_unlock_irqrestore(>lock, flags);
> + spin_unlock(>lock);
>  }
>  
>  static void __meminit __init_single_page(struct page *page, unsigned long 
> pfn,
> @@ -1240,6 +1239,7 @@ void __meminit reserve_bootmem_region(phys_addr_t 
> start, phys_addr_t end)
>  
>  static void __free_pages_ok(struct page *page, unsigned int order)
>  {
> + unsigned long flags;
>   int migratetype;
>   unsigned long pfn = page_to_pfn(page);
>  
> @@ -1247,7 +1247,10 @@ static void __free_pages_ok(struct page *page, 
> unsigned int order)
>   return;
>  
>   migratetype = get_pfnblock_migratetype(page, pfn);
> + local_irq_save(flags);
> + __count_vm_events(PGFREE, 1 << order);
>   free_one_page(page_zone(page), page, pfn, order, migratetype);
> + local_irq_restore(flags);
>  }
>  
>  static void __init __free_pages_boot_core(s

Re: [PATCH] Revert "mm, page_alloc: only use per-cpu allocator for irq-safe requests"

2017-04-15 Thread Jesper Dangaard Brouer
On Sat, 15 Apr 2017 15:53:50 +0100
Mel Gorman  wrote:

> This reverts commit 374ad05ab64d696303cec5cc8ec3a65d457b7b1c. While the
> patch worked great for userspace allocations, the fact that softirq loses
> the per-cpu allocator caused problems. It needs to be redone taking into
> account that a separate list is needed for hard/soft IRQs or alternatively
> find a cheap way of detecting reentry due to an interrupt. Both are possible
> but sufficiently tricky that it shouldn't be rushed. Jesper had one method
> for allowing softirqs but reported that the cost was high enough that it
> performed similarly to a plain revert. His figures for netperf TCP_STREAM
> were as follows
> 
> Baseline v4.10.0  : 60316 Mbit/s
> Current 4.11.0-rc6: 47491 Mbit/s
> This patch: 60662 Mbit/s
(should instead state "Jesper's patch" or "His patch")

Ran same test (8 parallel netperf TCP_STREAMs) with this patch applied:

 This patch 60106 Mbit/s (average of 7 iteration 60 sec runs)

With these speeds I'm starting to hit the memory bandwidth of my machines.
Thus, the 60 GBit/s measurement cannot be used to validate the
performance impact of reverting this compared to my softirq patch, it
only shows we fixed the regression.  (I'm suspicious as I see a higher
contention on the page allocator lock (4% vs 1.3%) with this patch and
still same performance... but lets worry about that outside the rc-series).

I would be interested in Tariq to re-run these benchmarks on some
hardware with enough memory bandwidth for 100Gbit/s throughput.


> As this is a regression, I wish to revert to noirq allocator for now and
> go back to the drawing board.
> 
> Signed-off-by: Mel Gorman 
> Reported-by: Tariq Toukan 

Acked-by: Jesper Dangaard Brouer 

> ---
>  mm/page_alloc.c | 43 ---
>  1 file changed, 20 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cbde310abed..3bba4f46214c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,10 +1090,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>  {
>   int migratetype = 0;
>   int batch_free = 0;
> - unsigned long nr_scanned, flags;
> + unsigned long nr_scanned;
>   bool isolated_pageblocks;
>  
> - spin_lock_irqsave(>lock, flags);
> + spin_lock(>lock);
>   isolated_pageblocks = has_isolate_pageblock(zone);
>   nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>   if (nr_scanned)
> @@ -1142,7 +1142,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   trace_mm_page_pcpu_drain(page, 0, mt);
>   } while (--count && --batch_free && !list_empty(list));
>   }
> - spin_unlock_irqrestore(>lock, flags);
> + spin_unlock(>lock);
>  }
>  
>  static void free_one_page(struct zone *zone,
> @@ -1150,9 +1150,8 @@ static void free_one_page(struct zone *zone,
>   unsigned int order,
>   int migratetype)
>  {
> - unsigned long nr_scanned, flags;
> - spin_lock_irqsave(>lock, flags);
> - __count_vm_events(PGFREE, 1 << order);
> + unsigned long nr_scanned;
> + spin_lock(>lock);
>   nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>   if (nr_scanned)
>   __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, 
> -nr_scanned);
> @@ -1162,7 +1161,7 @@ static void free_one_page(struct zone *zone,
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   }
>   __free_one_page(page, pfn, zone, order, migratetype);
> - spin_unlock_irqrestore(>lock, flags);
> + spin_unlock(>lock);
>  }
>  
>  static void __meminit __init_single_page(struct page *page, unsigned long 
> pfn,
> @@ -1240,6 +1239,7 @@ void __meminit reserve_bootmem_region(phys_addr_t 
> start, phys_addr_t end)
>  
>  static void __free_pages_ok(struct page *page, unsigned int order)
>  {
> + unsigned long flags;
>   int migratetype;
>   unsigned long pfn = page_to_pfn(page);
>  
> @@ -1247,7 +1247,10 @@ static void __free_pages_ok(struct page *page, 
> unsigned int order)
>   return;
>  
>   migratetype = get_pfnblock_migratetype(page, pfn);
> + local_irq_save(flags);
> + __count_vm_events(PGFREE, 1 << order);
>   free_one_page(page_zone(page), page, pfn, order, migratetype);
> + local_irq_restore(flags);
>  }
>  
>  static void __init __free_pages_boot_core(struct page *page, unsigned int 
> order)
> @@ -2219,9 +,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int 

Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-14 Thread Jesper Dangaard Brouer
On Mon, 10 Apr 2017 14:26:16 -0700
Andrew Morton <a...@linux-foundation.org> wrote:

> On Mon, 10 Apr 2017 16:08:21 +0100 Mel Gorman <mgor...@techsingularity.net> 
> wrote:
> 
> > IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> > of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> > allocator for irq-safe requests").
> > 
> > This unfortunately also included excluded SoftIRQ.  This hurt the 
> > performance
> > for the use-case of refilling DMA RX rings in softirq context.  
> 
> Out of curiosity: by how much did it "hurt"?
>
> 
> 
> Tariq found:
> 
> : I disabled the page-cache (recycle) mechanism to stress the page
> : allocator, and see a drastic degradation in BW, from 47.5 G in v4.10 to
> : 31.4 G in v4.11-rc1 (34% drop).

I've tried to reproduce this in my home testlab, using ConnectX-4 dual
100Gbit/s. Hardware limits cause that I cannot reach 100Gbit/s, once a
memory copy is performed.  (Word of warning: you need PCIe Gen3 width
16 (which I do have) to handle 100Gbit/s, and the memory bandwidth of
the system also need something like 2x 12500MBytes/s (which is where my
system failed)).

The mlx5 driver have a driver local page recycler, which I can see fail
between 29%-38% of the time, with 8 parallel netperf TCP_STREAMs.  I
speculate adding more streams will make in fail more.  To factor out
the driver recycler, I simply disable it (like I believe Tariq also did).

With disabled-mlx5-recycler, 8 parallel netperf TCP_STREAMs:

Baseline v4.10.0  : 60316 Mbit/s
Current 4.11.0-rc6: 47491 Mbit/s
This patch: 60662 Mbit/s

While this patch does "fix" the performance regression, it does not
bring any noticeable improvement (as my micro-bench also indicated),
thus I feel our previous optimization is almost nullified. (p.s. It
does feel wrong to argue against my own patch ;-)).

The reason for the current 4.11.0-rc6 regression is lock congestion on
the (per NUMA) page allocator lock, perf report show we spend 34.92% in
queued_spin_lock_slowpath (compared to top#2 copy cost of 13.81% in
copy_user_enhanced_fast_string).


> then with this patch he found
> 
> : It looks very good!  I get line-rate (94Gbits/sec) with 8 streams, in
> : comparison to less than 55Gbits/sec before.
> 
> Can I take this to mean that the page allocator's per-cpu-pages feature
> ended up doubling the performance of this driver?  Better than the
> driver's private page recycling?  I'd like to believe that, but am
> having trouble doing so ;)

I would not conclude that. I'm also very suspicious about such big
performance "jumps".  Tariq should also benchmark with v4.10 and a
disabled mlx5-recycler, as I believe the results should be the same as
after this patch.

That said, it is possible to see a regression this large, when all the
CPUs are congesting on the page allocator lock. AFAIK Tariq also
mentioned seeing 60% spend on the lock, which would confirm this theory.

 
> > This patch re-allow softirq context, which should be safe by disabling
> > BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> > and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> > never access the page allocator, thus it should be sufficient to only test
> > for !in_irq().
> > 
> > One concern with this change is adding a BH (enable) scheduling point at
> > both PCP alloc and free. If further concerns are highlighted by this patch,
> > the result wiill be to revert 374ad05ab64d and try again at a later date
> > to offset the irq enable/disable overhead.  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-14 Thread Jesper Dangaard Brouer
On Mon, 10 Apr 2017 14:26:16 -0700
Andrew Morton  wrote:

> On Mon, 10 Apr 2017 16:08:21 +0100 Mel Gorman  
> wrote:
> 
> > IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> > of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> > allocator for irq-safe requests").
> > 
> > This unfortunately also included excluded SoftIRQ.  This hurt the 
> > performance
> > for the use-case of refilling DMA RX rings in softirq context.  
> 
> Out of curiosity: by how much did it "hurt"?
>
> 
> 
> Tariq found:
> 
> : I disabled the page-cache (recycle) mechanism to stress the page
> : allocator, and see a drastic degradation in BW, from 47.5 G in v4.10 to
> : 31.4 G in v4.11-rc1 (34% drop).

I've tried to reproduce this in my home testlab, using ConnectX-4 dual
100Gbit/s. Hardware limits cause that I cannot reach 100Gbit/s, once a
memory copy is performed.  (Word of warning: you need PCIe Gen3 width
16 (which I do have) to handle 100Gbit/s, and the memory bandwidth of
the system also need something like 2x 12500MBytes/s (which is where my
system failed)).

The mlx5 driver have a driver local page recycler, which I can see fail
between 29%-38% of the time, with 8 parallel netperf TCP_STREAMs.  I
speculate adding more streams will make in fail more.  To factor out
the driver recycler, I simply disable it (like I believe Tariq also did).

With disabled-mlx5-recycler, 8 parallel netperf TCP_STREAMs:

Baseline v4.10.0  : 60316 Mbit/s
Current 4.11.0-rc6: 47491 Mbit/s
This patch: 60662 Mbit/s

While this patch does "fix" the performance regression, it does not
bring any noticeable improvement (as my micro-bench also indicated),
thus I feel our previous optimization is almost nullified. (p.s. It
does feel wrong to argue against my own patch ;-)).

The reason for the current 4.11.0-rc6 regression is lock congestion on
the (per NUMA) page allocator lock, perf report show we spend 34.92% in
queued_spin_lock_slowpath (compared to top#2 copy cost of 13.81% in
copy_user_enhanced_fast_string).


> then with this patch he found
> 
> : It looks very good!  I get line-rate (94Gbits/sec) with 8 streams, in
> : comparison to less than 55Gbits/sec before.
> 
> Can I take this to mean that the page allocator's per-cpu-pages feature
> ended up doubling the performance of this driver?  Better than the
> driver's private page recycling?  I'd like to believe that, but am
> having trouble doing so ;)

I would not conclude that. I'm also very suspicious about such big
performance "jumps".  Tariq should also benchmark with v4.10 and a
disabled mlx5-recycler, as I believe the results should be the same as
after this patch.

That said, it is possible to see a regression this large, when all the
CPUs are congesting on the page allocator lock. AFAIK Tariq also
mentioned seeing 60% spend on the lock, which would confirm this theory.

 
> > This patch re-allow softirq context, which should be safe by disabling
> > BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> > and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> > never access the page allocator, thus it should be sufficient to only test
> > for !in_irq().
> > 
> > One concern with this change is adding a BH (enable) scheduling point at
> > both PCP alloc and free. If further concerns are highlighted by this patch,
> > the result wiill be to revert 374ad05ab64d and try again at a later date
> > to offset the irq enable/disable overhead.  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-10 Thread Jesper Dangaard Brouer

I will appreciate review of this patch. My micro-benchmarking show we
basically return to same page alloc+free cost as before 374ad05ab64d
("mm, page_alloc: only use per-cpu allocator for irq-safe requests").
Which sort of invalidates this attempt of optimizing the page allocator.
But Mel's micro-benchmarks still show an improvement.

Notice the slowdown comes from me checking irqs_disabled() ... if
someone can spot a way to get rid of that this, then this patch will be
a win.

I'm traveling out of Montreal today, and will rerun my benchmarks when
I get home.  Tariq will also do some more testing with 100G NIC (he
also participated in the Montreal conf, so he is likely in transit too).


On Mon, 10 Apr 2017 16:08:21 +0100
Mel Gorman <mgor...@techsingularity.net> wrote:

> From: Jesper Dangaard Brouer <bro...@redhat.com>
> 
> IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> allocator for irq-safe requests").
> 
> This unfortunately also included excluded SoftIRQ.  This hurt the performance
> for the use-case of refilling DMA RX rings in softirq context.
> 
> This patch re-allow softirq context, which should be safe by disabling
> BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> never access the page allocator, thus it should be sufficient to only test
> for !in_irq().
> 
> One concern with this change is adding a BH (enable) scheduling point at
> both PCP alloc and free. If further concerns are highlighted by this patch,
> the result wiill be to revert 374ad05ab64d and try again at a later date
> to offset the irq enable/disable overhead.
> 
> Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe 
> requests")
> Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>
> Signed-off-by: Mel Gorman <mgor...@techsingularity.net>

Missing:

Reported-by: Tariq [...]

> ---
>  mm/page_alloc.c | 26 +-
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cbde310abed..d7e986967910 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2351,9 +2351,9 @@ static void drain_local_pages_wq(struct work_struct 
> *work)
>* cpu which is allright but we also have to make sure to not move to
>* a different one.
>*/
> - preempt_disable();
> + local_bh_disable();
>   drain_local_pages(NULL);
> - preempt_enable();
> + local_bh_enable();
>  }
>  
>  /*
> @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>   unsigned long pfn = page_to_pfn(page);
>   int migratetype;
>  
> - if (in_interrupt()) {
> + /*
> +  * Exclude (hard) IRQ and NMI context from using the pcplists.
> +  * But allow softirq context, via disabling BH.
> +  */
> + if (in_irq() || irqs_disabled()) {
>   __free_pages_ok(page, 0);
>   return;
>   }
> @@ -2491,7 +2495,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>  
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   set_pcppage_migratetype(page, migratetype);
> - preempt_disable();
> + local_bh_disable();
>  
>   /*
>* We only track unmovable, reclaimable and movable on pcp lists.
> @@ -2522,7 +2526,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>   }
>  
>  out:
> - preempt_enable();
> + local_bh_enable();
>  }
>  
>  /*
> @@ -2647,7 +2651,7 @@ static struct page *__rmqueue_pcplist(struct zone 
> *zone, int migratetype,
>  {
>   struct page *page;
>  
> - VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_irq() || irqs_disabled());
>  
>   do {
>   if (list_empty(list)) {
> @@ -2680,7 +2684,7 @@ static struct page *rmqueue_pcplist(struct zone 
> *preferred_zone,
>   bool cold = ((gfp_flags & __GFP_COLD) != 0);
>   struct page *page;
>  
> - preempt_disable();
> + local_bh_disable();
>   pcp = _cpu_ptr(zone->pageset)->pcp;
>   list = >lists[migratetype];
>   page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
> @@ -2688,7 +2692,7 @@ static struct page *rmqueue_pcplist(struct zone 
> *preferred_zone,
>   __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>   zone_statistics(preferred_zone, zone);
>   }
> - preempt_enable();
> + local_bh_enable();
>   return page;
>  }
>  
> @@ 

Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-10 Thread Jesper Dangaard Brouer

I will appreciate review of this patch. My micro-benchmarking show we
basically return to same page alloc+free cost as before 374ad05ab64d
("mm, page_alloc: only use per-cpu allocator for irq-safe requests").
Which sort of invalidates this attempt of optimizing the page allocator.
But Mel's micro-benchmarks still show an improvement.

Notice the slowdown comes from me checking irqs_disabled() ... if
someone can spot a way to get rid of that this, then this patch will be
a win.

I'm traveling out of Montreal today, and will rerun my benchmarks when
I get home.  Tariq will also do some more testing with 100G NIC (he
also participated in the Montreal conf, so he is likely in transit too).


On Mon, 10 Apr 2017 16:08:21 +0100
Mel Gorman  wrote:

> From: Jesper Dangaard Brouer 
> 
> IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> allocator for irq-safe requests").
> 
> This unfortunately also included excluded SoftIRQ.  This hurt the performance
> for the use-case of refilling DMA RX rings in softirq context.
> 
> This patch re-allow softirq context, which should be safe by disabling
> BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> never access the page allocator, thus it should be sufficient to only test
> for !in_irq().
> 
> One concern with this change is adding a BH (enable) scheduling point at
> both PCP alloc and free. If further concerns are highlighted by this patch,
> the result wiill be to revert 374ad05ab64d and try again at a later date
> to offset the irq enable/disable overhead.
> 
> Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe 
> requests")
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Mel Gorman 

Missing:

Reported-by: Tariq [...]

> ---
>  mm/page_alloc.c | 26 +-
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cbde310abed..d7e986967910 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2351,9 +2351,9 @@ static void drain_local_pages_wq(struct work_struct 
> *work)
>* cpu which is allright but we also have to make sure to not move to
>* a different one.
>*/
> - preempt_disable();
> + local_bh_disable();
>   drain_local_pages(NULL);
> - preempt_enable();
> + local_bh_enable();
>  }
>  
>  /*
> @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>   unsigned long pfn = page_to_pfn(page);
>   int migratetype;
>  
> - if (in_interrupt()) {
> + /*
> +  * Exclude (hard) IRQ and NMI context from using the pcplists.
> +  * But allow softirq context, via disabling BH.
> +  */
> + if (in_irq() || irqs_disabled()) {
>   __free_pages_ok(page, 0);
>   return;
>   }
> @@ -2491,7 +2495,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>  
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   set_pcppage_migratetype(page, migratetype);
> - preempt_disable();
> + local_bh_disable();
>  
>   /*
>* We only track unmovable, reclaimable and movable on pcp lists.
> @@ -2522,7 +2526,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>   }
>  
>  out:
> - preempt_enable();
> + local_bh_enable();
>  }
>  
>  /*
> @@ -2647,7 +2651,7 @@ static struct page *__rmqueue_pcplist(struct zone 
> *zone, int migratetype,
>  {
>   struct page *page;
>  
> - VM_BUG_ON(in_interrupt());
> + VM_BUG_ON(in_irq() || irqs_disabled());
>  
>   do {
>   if (list_empty(list)) {
> @@ -2680,7 +2684,7 @@ static struct page *rmqueue_pcplist(struct zone 
> *preferred_zone,
>   bool cold = ((gfp_flags & __GFP_COLD) != 0);
>   struct page *page;
>  
> - preempt_disable();
> + local_bh_disable();
>   pcp = _cpu_ptr(zone->pageset)->pcp;
>   list = >lists[migratetype];
>   page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
> @@ -2688,7 +2692,7 @@ static struct page *rmqueue_pcplist(struct zone 
> *preferred_zone,
>   __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>   zone_statistics(preferred_zone, zone);
>   }
> - preempt_enable();
> + local_bh_enable();
>   return page;
>  }
>  
> @@ -2704,7 +2708,11 @@ struct page *rmqueue(struct zone *preferred_zone,
>   unsigned long flags;
>   struct page *page;
>  
&g

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-08 Thread Jesper Dangaard Brouer
t; +  * producer won't make progress and touch other cache lines
> +  * besides the first one until we write out all entries.
> +  */
> + while (likely(head >= r->consumer_tail))
> + r->queue[head--] = NULL;
> + r->consumer_tail = r->consumer_head;
> + }
> + if (unlikely(r->consumer_head >= r->size)) {
> + r->consumer_head = 0;
> + r->consumer_tail = 0;
> +     }
>  }

I love this idea.  Reviewed and discussed the idea in-person with MST
during netdevconf[1] at this laptop.  I promised I will also run it
through my micro-benchmarking[2] once I return home (hint ptr_ring gets
used in network stack as skb_array).

Reviewed-by: Jesper Dangaard Brouer <bro...@redhat.com>

[1] http://netdevconf.org/2.1/
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-08 Thread Jesper Dangaard Brouer
nd touch other cache lines
> +  * besides the first one until we write out all entries.
> +  */
> + while (likely(head >= r->consumer_tail))
> + r->queue[head--] = NULL;
> + r->consumer_tail = r->consumer_head;
> + }
> + if (unlikely(r->consumer_head >= r->size)) {
> + r->consumer_head = 0;
> + r->consumer_tail = 0;
> + }
>  }

I love this idea.  Reviewed and discussed the idea in-person with MST
during netdevconf[1] at this laptop.  I promised I will also run it
through my micro-benchmarking[2] once I return home (hint ptr_ring gets
used in network stack as skb_array).

Reviewed-by: Jesper Dangaard Brouer 

[1] http://netdevconf.org/2.1/
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: in_irq_or_nmi() and RFC patch

2017-03-30 Thread Jesper Dangaard Brouer
On Thu, 30 Mar 2017 14:04:36 +0100
Mel Gorman <mgor...@techsingularity.net> wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > > following warning below:
> > > 
> > > [0.00] Kernel command line: 
> > > BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ 
> > > root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet 
> > > LANG=en_DK.UTF
> > > -8
> > > [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > > [0.00] [ cut here ]
> > > [0.00] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 
> > > __local_bh_enable_ip+0x70/0x90
> > > [0.00] Modules linked in:
> > > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> > > 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > > [0.00] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), 
> > > BIOS 1.60 12/16/2015
> > > [0.00] Call Trace:
> > > [0.00]  dump_stack+0x4f/0x73
> > > [0.00]  __warn+0xcb/0xf0
> > > [0.00]  warn_slowpath_null+0x1d/0x20
> > > [0.00]  __local_bh_enable_ip+0x70/0x90
> > > [0.00]  free_hot_cold_page+0x1a4/0x2f0
> > > [0.00]  __free_pages+0x1f/0x30
> > > [0.00]  __free_pages_bootmem+0xab/0xb8
> > > [0.00]  __free_memory_core+0x79/0x91
> > > [0.00]  free_all_bootmem+0xaa/0x122
> > > [0.00]  mem_init+0x71/0xa4
> > > [0.00]  start_kernel+0x1e5/0x3f1
> > > [0.00]  x86_64_start_reservations+0x2a/0x2c
> > > [0.00]  x86_64_start_kernel+0x178/0x18b
> > > [0.00]  start_cpu+0x14/0x14
> > > [0.00]  ? start_cpu+0x14/0x14
> > > [0.00] ---[ end trace a57944bec8fc985c ]---
> > > [0.00] Memory: 32739472K/33439416K available (7624K kernel code, 
> > > 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K 
> > > cma-reserved)
> > > 
> > > And kernel/softirq.c:161 contains:
> > > 
> > >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > > 
> > > Thus, I don't think the change in my RFC-patch[1] is safe.
> > > Of changing[2] to support softirq allocations by replacing
> > > preempt_disable() with local_bh_disable().
> > > 
> > > [1] http://lkml.kernel.org/r/20170327143947.4c237...@redhat.com
> > > 
> > > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for 
> > > irq-safe requests")
> > >  https://git.kernel.org/torvalds/c/374ad05ab64d  
> > 
> > A patch that avoids the above warning is inlined below, but I'm not
> > sure if this is best direction.  Or we should rather consider reverting
> > part of commit 374ad05ab64d to avoid the softirq performance regression?
> >
> 
> At the moment, I'm not seeing a better alternative. If this works, I
> think it would still be far superior in terms of performance than a
> revert. 

Started performance benchmarking:
 163 cycles = current state
 183 cycles = with BH disable + in_irq
 218 cycles = with BH disable + in_irq + irqs_disabled

Thus, the performance numbers unfortunately looks bad, once we add the
test for irqs_disabled().  The slowdown by replacing preempt_disable
with BH-disable is still a win (we saved 29 cycles before, and loose
20, I was expecting regression to be only 10 cycles).

Bad things happen when adding the test for irqs_disabled().  This
likely happens because it uses the "pushfq + pop" to read CPU flags.  I
wonder if X86-experts know if e.g. using "lahf" would be faster (and if
it also loads the interrupt flag X86_EFLAGS_IF)?

We basically lost more (163-218=-55) than we gained (29) :-(


> As before, if there are bad consequences to adding a BH
> rescheduling point then we'll have to revert. However, I don't like a
> revert being the first option as it'll keep encouraging drivers to build
> sub-allocators to avoid the page allocator.

I'm also motivated by speeding up the page allocator to avoid this
happening in all the drivers.

> > [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator
> > 
> > From: Jesper Dangaard Brouer <bro...@redhat.com>
> >   
> 
> Other than the slightly misleading comments about NMI which could
> explain "this potentially misses an NMI but an NMI allocating pages is
> brain damaged", I don't see a problem. The irqs_disabled() check is a
> subtle but it's not earth shattering and it still helps the 100GiB cases
> with the limited cycle budget to process packets.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: in_irq_or_nmi() and RFC patch

2017-03-30 Thread Jesper Dangaard Brouer
On Thu, 30 Mar 2017 14:04:36 +0100
Mel Gorman  wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > > following warning below:
> > > 
> > > [0.00] Kernel command line: 
> > > BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ 
> > > root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet 
> > > LANG=en_DK.UTF
> > > -8
> > > [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > > [0.00] [ cut here ]
> > > [0.00] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 
> > > __local_bh_enable_ip+0x70/0x90
> > > [0.00] Modules linked in:
> > > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> > > 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > > [0.00] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), 
> > > BIOS 1.60 12/16/2015
> > > [0.00] Call Trace:
> > > [0.00]  dump_stack+0x4f/0x73
> > > [0.00]  __warn+0xcb/0xf0
> > > [0.00]  warn_slowpath_null+0x1d/0x20
> > > [0.00]  __local_bh_enable_ip+0x70/0x90
> > > [0.00]  free_hot_cold_page+0x1a4/0x2f0
> > > [0.00]  __free_pages+0x1f/0x30
> > > [0.00]  __free_pages_bootmem+0xab/0xb8
> > > [0.00]  __free_memory_core+0x79/0x91
> > > [0.00]  free_all_bootmem+0xaa/0x122
> > > [0.00]  mem_init+0x71/0xa4
> > > [0.00]  start_kernel+0x1e5/0x3f1
> > > [0.00]  x86_64_start_reservations+0x2a/0x2c
> > > [0.00]  x86_64_start_kernel+0x178/0x18b
> > > [0.00]  start_cpu+0x14/0x14
> > > [0.00]  ? start_cpu+0x14/0x14
> > > [0.00] ---[ end trace a57944bec8fc985c ]---
> > > [0.00] Memory: 32739472K/33439416K available (7624K kernel code, 
> > > 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K 
> > > cma-reserved)
> > > 
> > > And kernel/softirq.c:161 contains:
> > > 
> > >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > > 
> > > Thus, I don't think the change in my RFC-patch[1] is safe.
> > > Of changing[2] to support softirq allocations by replacing
> > > preempt_disable() with local_bh_disable().
> > > 
> > > [1] http://lkml.kernel.org/r/20170327143947.4c237...@redhat.com
> > > 
> > > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for 
> > > irq-safe requests")
> > >  https://git.kernel.org/torvalds/c/374ad05ab64d  
> > 
> > A patch that avoids the above warning is inlined below, but I'm not
> > sure if this is best direction.  Or we should rather consider reverting
> > part of commit 374ad05ab64d to avoid the softirq performance regression?
> >
> 
> At the moment, I'm not seeing a better alternative. If this works, I
> think it would still be far superior in terms of performance than a
> revert. 

Started performance benchmarking:
 163 cycles = current state
 183 cycles = with BH disable + in_irq
 218 cycles = with BH disable + in_irq + irqs_disabled

Thus, the performance numbers unfortunately looks bad, once we add the
test for irqs_disabled().  The slowdown by replacing preempt_disable
with BH-disable is still a win (we saved 29 cycles before, and loose
20, I was expecting regression to be only 10 cycles).

Bad things happen when adding the test for irqs_disabled().  This
likely happens because it uses the "pushfq + pop" to read CPU flags.  I
wonder if X86-experts know if e.g. using "lahf" would be faster (and if
it also loads the interrupt flag X86_EFLAGS_IF)?

We basically lost more (163-218=-55) than we gained (29) :-(


> As before, if there are bad consequences to adding a BH
> rescheduling point then we'll have to revert. However, I don't like a
> revert being the first option as it'll keep encouraging drivers to build
> sub-allocators to avoid the page allocator.

I'm also motivated by speeding up the page allocator to avoid this
happening in all the drivers.

> > [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator
> > 
> > From: Jesper Dangaard Brouer 
> >   
> 
> Other than the slightly misleading comments about NMI which could
> explain "this potentially misses an NMI but an NMI allocating pages is
> brain damaged", I don't see a problem. The irqs_disabled() check is a
> subtle but it's not earth shattering and it still helps the 100GiB cases
> with the limited cycle budget to process packets.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: in_irq_or_nmi() and RFC patch

2017-03-30 Thread Jesper Dangaard Brouer
On Thu, 30 Mar 2017 09:35:02 +0200
Peter Zijlstra <pet...@infradead.org> wrote:

> On Thu, Mar 30, 2017 at 09:12:23AM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 30 Mar 2017 08:49:58 +0200
> > Peter Zijlstra <pet...@infradead.org> wrote:
> >   
> > > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:  
> > > > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool 
> > > > cold)
> > > > unsigned long pfn = page_to_pfn(page);
> > > > int migratetype;
> > > >  
> > > > -   if (in_interrupt()) {
> > > > +   /*
> > > > +* Exclude (hard) IRQ and NMI context from using the pcplists.
> > > > +* But allow softirq context, via disabling BH.
> > > > +*/
> > > > +   if (in_irq() || irqs_disabled()) {
> > > 
> > > Why do you need irqs_disabled() ?   
> > 
> > Because further down I call local_bh_enable(), which calls
> > __local_bh_enable_ip() which triggers a warning during early boot on:
> > 
> >   WARN_ON_ONCE(in_irq() || irqs_disabled());
> > 
> > It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.  
> 
> Ah, no. Its because when you do things like:
> 
>   local_irq_disable();
>   local_bh_enable();
>   local_irq_enable();
> 
> you can loose a pending softirq.
> 
> Bugger.. that irqs_disabled() is something we could do without.

Yes, I really don't like adding this irqs_disabled() check here.

> I'm thinking that when tglx finishes his soft irq disable patches for
> x86 (same thing ppc also does) we can go revert all these patches.
> 
> Thomas, see:
> 
>   https://lkml.kernel.org/r/20170301144845.783f8...@redhat.com

The summary is Mel and I found a way to optimized the page allocator,
by avoiding a local_irq_{save,restore} operation, see commit
374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe
requests")  [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696

But Tariq discovered that this caused a regression for 100Gbit/s NICs,
as the patch excluded softirq from using the per-cpu-page (PCP) lists.
As DMA RX page-refill happens in softirq context.

Now we are trying to re-enable allowing softirq to use the PCP.
My proposal is: https://lkml.kernel.org/r/20170329214441.08332...@redhat.com
The alternative is to revert this optimization.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: in_irq_or_nmi() and RFC patch

2017-03-30 Thread Jesper Dangaard Brouer
On Thu, 30 Mar 2017 09:35:02 +0200
Peter Zijlstra  wrote:

> On Thu, Mar 30, 2017 at 09:12:23AM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 30 Mar 2017 08:49:58 +0200
> > Peter Zijlstra  wrote:
> >   
> > > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:  
> > > > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool 
> > > > cold)
> > > > unsigned long pfn = page_to_pfn(page);
> > > > int migratetype;
> > > >  
> > > > -   if (in_interrupt()) {
> > > > +   /*
> > > > +* Exclude (hard) IRQ and NMI context from using the pcplists.
> > > > +* But allow softirq context, via disabling BH.
> > > > +*/
> > > > +   if (in_irq() || irqs_disabled()) {
> > > 
> > > Why do you need irqs_disabled() ?   
> > 
> > Because further down I call local_bh_enable(), which calls
> > __local_bh_enable_ip() which triggers a warning during early boot on:
> > 
> >   WARN_ON_ONCE(in_irq() || irqs_disabled());
> > 
> > It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.  
> 
> Ah, no. Its because when you do things like:
> 
>   local_irq_disable();
>   local_bh_enable();
>   local_irq_enable();
> 
> you can loose a pending softirq.
> 
> Bugger.. that irqs_disabled() is something we could do without.

Yes, I really don't like adding this irqs_disabled() check here.

> I'm thinking that when tglx finishes his soft irq disable patches for
> x86 (same thing ppc also does) we can go revert all these patches.
> 
> Thomas, see:
> 
>   https://lkml.kernel.org/r/20170301144845.783f8...@redhat.com

The summary is Mel and I found a way to optimized the page allocator,
by avoiding a local_irq_{save,restore} operation, see commit
374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe
requests")  [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696

But Tariq discovered that this caused a regression for 100Gbit/s NICs,
as the patch excluded softirq from using the per-cpu-page (PCP) lists.
As DMA RX page-refill happens in softirq context.

Now we are trying to re-enable allowing softirq to use the PCP.
My proposal is: https://lkml.kernel.org/r/20170329214441.08332...@redhat.com
The alternative is to revert this optimization.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: in_irq_or_nmi() and RFC patch

2017-03-30 Thread Jesper Dangaard Brouer
On Thu, 30 Mar 2017 08:49:58 +0200
Peter Zijlstra <pet...@infradead.org> wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> > unsigned long pfn = page_to_pfn(page);
> > int migratetype;
> >  
> > -   if (in_interrupt()) {
> > +   /*
> > +* Exclude (hard) IRQ and NMI context from using the pcplists.
> > +* But allow softirq context, via disabling BH.
> > +*/
> > +   if (in_irq() || irqs_disabled()) {  
> 
> Why do you need irqs_disabled() ? 

Because further down I call local_bh_enable(), which calls
__local_bh_enable_ip() which triggers a warning during early boot on:

  WARN_ON_ONCE(in_irq() || irqs_disabled());

It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.


> Also, your comment is stale, it still refers to NMI context.

True, as you told me NMI is implicit, as it cannot occur.

> > __free_pages_ok(page, 0);
> > return;
> > }  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: in_irq_or_nmi() and RFC patch

2017-03-30 Thread Jesper Dangaard Brouer
On Thu, 30 Mar 2017 08:49:58 +0200
Peter Zijlstra  wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> > unsigned long pfn = page_to_pfn(page);
> > int migratetype;
> >  
> > -   if (in_interrupt()) {
> > +   /*
> > +* Exclude (hard) IRQ and NMI context from using the pcplists.
> > +* But allow softirq context, via disabling BH.
> > +*/
> > +   if (in_irq() || irqs_disabled()) {  
> 
> Why do you need irqs_disabled() ? 

Because further down I call local_bh_enable(), which calls
__local_bh_enable_ip() which triggers a warning during early boot on:

  WARN_ON_ONCE(in_irq() || irqs_disabled());

It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.


> Also, your comment is stale, it still refers to NMI context.

True, as you told me NMI is implicit, as it cannot occur.

> > __free_pages_ok(page, 0);
> > return;
> > }  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: in_irq_or_nmi() and RFC patch

2017-03-29 Thread Jesper Dangaard Brouer
On Wed, 29 Mar 2017 21:11:44 +0200
Jesper Dangaard Brouer <bro...@redhat.com> wrote:

> On Wed, 29 Mar 2017 11:12:26 -0700 Matthew Wilcox <wi...@infradead.org> wrote:
> 
> > On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:  
> > > On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:   
> > >  
> > > > On Wed, 29 Mar 2017 10:12:19 +0200
> > > > Peter Zijlstra <pet...@infradead.org> wrote:
> > > > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > > > needs in_nmi()? That sounds beyond broken.
> > > > 
> > > > It is the other way around. We want to exclude NMI and HARDIRQ from
> > > > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > > > fall-through using the normal buddy allocator path).
> > > 
> > > Any in_nmi() code arriving at the allocator is broken. No need to fix
> > > the allocator.
> > 
> > That's demonstrably true.  You can't grab a spinlock in NMI code and
> > the first thing that happens if this in_irq_or_nmi() check fails is ...
> > spin_lock_irqsave(>lock, flags);
> > so this patch should just use in_irq().
> > 
> > (the concept of NMI code needing to allocate memory was blowing my mind
> > a little bit)  
> 
> Regardless or using in_irq() (or in combi with in_nmi()) I get the
> following warning below:
> 
> [0.00] Kernel command line: 
> BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ 
> root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> -8
> [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [0.00] [ cut here ]
> [0.00] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 
> __local_bh_enable_ip+0x70/0x90
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> [0.00] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 
> 1.60 12/16/2015
> [0.00] Call Trace:
> [0.00]  dump_stack+0x4f/0x73
> [0.00]  __warn+0xcb/0xf0
> [0.00]  warn_slowpath_null+0x1d/0x20
> [0.00]  __local_bh_enable_ip+0x70/0x90
> [0.00]  free_hot_cold_page+0x1a4/0x2f0
> [0.00]  __free_pages+0x1f/0x30
> [0.00]  __free_pages_bootmem+0xab/0xb8
> [0.00]  __free_memory_core+0x79/0x91
> [0.00]  free_all_bootmem+0xaa/0x122
> [0.00]  mem_init+0x71/0xa4
> [0.00]  start_kernel+0x1e5/0x3f1
> [0.00]  x86_64_start_reservations+0x2a/0x2c
> [0.00]  x86_64_start_kernel+0x178/0x18b
> [0.00]  start_cpu+0x14/0x14
> [0.00]  ? start_cpu+0x14/0x14
> [0.00] ---[ end trace a57944bec8fc985c ]---
> [0.00] Memory: 32739472K/33439416K available (7624K kernel code, 
> 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K 
> cma-reserved)
> 
> And kernel/softirq.c:161 contains:
> 
>  WARN_ON_ONCE(in_irq() || irqs_disabled());
> 
> Thus, I don't think the change in my RFC-patch[1] is safe.
> Of changing[2] to support softirq allocations by replacing
> preempt_disable() with local_bh_disable().
> 
> [1] http://lkml.kernel.org/r/20170327143947.4c237...@redhat.com
> 
> [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for 
> irq-safe requests")
>  https://git.kernel.org/torvalds/c/374ad05ab64d

A patch that avoids the above warning is inlined below, but I'm not
sure if this is best direction.  Or we should rather consider reverting
part of commit 374ad05ab64d to avoid the softirq performance regression?
 

[PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

From: Jesper Dangaard Brouer <bro...@redhat.com>

IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists
caching of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only
use per-cpu allocator for irq-safe requests").

This unfortunately also included excluded SoftIRQ.  This hurt the
performance for the use-case of refilling DMA RX rings in softirq
context.

This patch re-allow softirq context, which should be safe by disabling
BH/softirq, while accessing the list.  PCP-lists access from both
hard-IRQ and NMI context must not be allowed.  Peter Zijlstra says
in_nmi() code never access the page allocator, thus it should be
sufficient to only test for !in_irq().

One concern with this change is adding a BH (enable) scheduling point
at both PCP alloc and free.

Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe 
requests")
Signed-off-by: Jesper D

Re: in_irq_or_nmi() and RFC patch

2017-03-29 Thread Jesper Dangaard Brouer
On Wed, 29 Mar 2017 21:11:44 +0200
Jesper Dangaard Brouer  wrote:

> On Wed, 29 Mar 2017 11:12:26 -0700 Matthew Wilcox  wrote:
> 
> > On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:  
> > > On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:   
> > >  
> > > > On Wed, 29 Mar 2017 10:12:19 +0200
> > > > Peter Zijlstra  wrote:
> > > > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > > > needs in_nmi()? That sounds beyond broken.
> > > > 
> > > > It is the other way around. We want to exclude NMI and HARDIRQ from
> > > > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > > > fall-through using the normal buddy allocator path).
> > > 
> > > Any in_nmi() code arriving at the allocator is broken. No need to fix
> > > the allocator.
> > 
> > That's demonstrably true.  You can't grab a spinlock in NMI code and
> > the first thing that happens if this in_irq_or_nmi() check fails is ...
> > spin_lock_irqsave(>lock, flags);
> > so this patch should just use in_irq().
> > 
> > (the concept of NMI code needing to allocate memory was blowing my mind
> > a little bit)  
> 
> Regardless or using in_irq() (or in combi with in_nmi()) I get the
> following warning below:
> 
> [0.00] Kernel command line: 
> BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ 
> root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> -8
> [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [0.00] [ cut here ]
> [0.00] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 
> __local_bh_enable_ip+0x70/0x90
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> [0.00] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 
> 1.60 12/16/2015
> [0.00] Call Trace:
> [0.00]  dump_stack+0x4f/0x73
> [0.00]  __warn+0xcb/0xf0
> [0.00]  warn_slowpath_null+0x1d/0x20
> [0.00]  __local_bh_enable_ip+0x70/0x90
> [0.00]  free_hot_cold_page+0x1a4/0x2f0
> [0.00]  __free_pages+0x1f/0x30
> [0.00]  __free_pages_bootmem+0xab/0xb8
> [0.00]  __free_memory_core+0x79/0x91
> [0.00]  free_all_bootmem+0xaa/0x122
> [0.00]  mem_init+0x71/0xa4
> [0.00]  start_kernel+0x1e5/0x3f1
> [0.00]  x86_64_start_reservations+0x2a/0x2c
> [0.00]  x86_64_start_kernel+0x178/0x18b
> [0.00]  start_cpu+0x14/0x14
> [0.00]  ? start_cpu+0x14/0x14
> [0.00] ---[ end trace a57944bec8fc985c ]---
> [0.00] Memory: 32739472K/33439416K available (7624K kernel code, 
> 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K 
> cma-reserved)
> 
> And kernel/softirq.c:161 contains:
> 
>  WARN_ON_ONCE(in_irq() || irqs_disabled());
> 
> Thus, I don't think the change in my RFC-patch[1] is safe.
> Of changing[2] to support softirq allocations by replacing
> preempt_disable() with local_bh_disable().
> 
> [1] http://lkml.kernel.org/r/20170327143947.4c237...@redhat.com
> 
> [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for 
> irq-safe requests")
>  https://git.kernel.org/torvalds/c/374ad05ab64d

A patch that avoids the above warning is inlined below, but I'm not
sure if this is best direction.  Or we should rather consider reverting
part of commit 374ad05ab64d to avoid the softirq performance regression?
 

[PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

From: Jesper Dangaard Brouer 

IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists
caching of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only
use per-cpu allocator for irq-safe requests").

This unfortunately also included excluded SoftIRQ.  This hurt the
performance for the use-case of refilling DMA RX rings in softirq
context.

This patch re-allow softirq context, which should be safe by disabling
BH/softirq, while accessing the list.  PCP-lists access from both
hard-IRQ and NMI context must not be allowed.  Peter Zijlstra says
in_nmi() code never access the page allocator, thus it should be
sufficient to only test for !in_irq().

One concern with this change is adding a BH (enable) scheduling point
at both PCP alloc and free.

Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe 
requests")
Signed-off-by: Jesper Dangaard Brouer 
---
 mm/page_alloc.c |   26 +-
 1 file changed, 17 insertion

<    1   2   3   4   5   6   >