Re: [RFC v3 06/22] landlock: Add LSM hooks

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9aa01d9d3d80..36c3e482239c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -85,6 +85,8 @@ enum bpf_arg_type {
>  
>   ARG_PTR_TO_CTX, /* pointer to context */
>   ARG_ANYTHING,   /* any (initialized) argument is ok */
> +
> + ARG_PTR_TO_STRUCT_FILE, /* pointer to struct file */

This should go into patch 7 I guess?

> +void __init landlock_add_hooks(void)
> +{
> + pr_info("landlock: Becoming ready for sandboxing\n");
> + security_add_hooks(landlock_hooks, ARRAY_SIZE(landlock_hooks));
> +}

Can we add the hooks when we load the first BPF program for a hook? That
would also allow to not make this conditional on a new config option
which all all distros have to enable anyway.

I would really like to see this patch split into the LSM part which
allows running BPF progs at LSM and your specific sandboxing use case
which requires the new BPF helpers, new reg type, etc.


Re: [RFC v3 06/22] landlock: Add LSM hooks

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9aa01d9d3d80..36c3e482239c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -85,6 +85,8 @@ enum bpf_arg_type {
>  
>   ARG_PTR_TO_CTX, /* pointer to context */
>   ARG_ANYTHING,   /* any (initialized) argument is ok */
> +
> + ARG_PTR_TO_STRUCT_FILE, /* pointer to struct file */

This should go into patch 7 I guess?

> +void __init landlock_add_hooks(void)
> +{
> + pr_info("landlock: Becoming ready for sandboxing\n");
> + security_add_hooks(landlock_hooks, ARRAY_SIZE(landlock_hooks));
> +}

Can we add the hooks when we load the first BPF program for a hook? That
would also allow to not make this conditional on a new config option
which all all distros have to enable anyway.

I would really like to see this patch split into the LSM part which
allows running BPF progs at LSM and your specific sandboxing use case
which requires the new BPF helpers, new reg type, etc.


Re: [RFC v3 05/22] bpf,landlock: Add eBPF program subtype and is_valid_subtype() verifier

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> @@ -155,6 +163,7 @@ union bpf_attr {
>   __u32   log_size;   /* size of user buffer */
>   __aligned_u64   log_buf;/* user supplied buffer */
>   __u32   kern_version;   /* checked when 
> prog_type=kprobe */
> + union bpf_prog_subtype prog_subtype;/* checked when 
> prog_type=landlock */

The comment seems bogus, this is not landlock specific.


Re: [RFC v3 05/22] bpf,landlock: Add eBPF program subtype and is_valid_subtype() verifier

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> @@ -155,6 +163,7 @@ union bpf_attr {
>   __u32   log_size;   /* size of user buffer */
>   __aligned_u64   log_buf;/* user supplied buffer */
>   __u32   kern_version;   /* checked when 
> prog_type=kprobe */
> + union bpf_prog_subtype prog_subtype;/* checked when 
> prog_type=landlock */

The comment seems bogus, this is not landlock specific.


Re: [RFC v3 04/22] bpf: Set register type according to is_valid_access()

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> This fix a pointer leak when an unprivileged eBPF program read a pointer
> value from the context. Even if is_valid_access() returns a pointer
> type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
> value containing an address is then allowed to leak. Moreover, this
> prevented unprivileged eBPF programs to use functions with (legitimate)
> pointer arguments.
> 
> This bug was not a problem until now because the only unprivileged eBPF
> program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
> from its context are UNKNOWN_VALUE.
> 
> Signed-off-by: Mickaël Salaün 
> Fixes: 969bf05eb3ce ("bpf: direct packet access")
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 

Can you post this fix separately? It's valid and needed outside of the
scope of this series.


Re: [RFC v3 04/22] bpf: Set register type according to is_valid_access()

2016-10-19 Thread Thomas Graf
On 09/14/16 at 09:23am, Mickaël Salaün wrote:
> This fix a pointer leak when an unprivileged eBPF program read a pointer
> value from the context. Even if is_valid_access() returns a pointer
> type, the eBPF verifier replace it with UNKNOWN_VALUE. The register
> value containing an address is then allowed to leak. Moreover, this
> prevented unprivileged eBPF programs to use functions with (legitimate)
> pointer arguments.
> 
> This bug was not a problem until now because the only unprivileged eBPF
> program allowed is of type BPF_PROG_TYPE_SOCKET_FILTER and all the types
> from its context are UNKNOWN_VALUE.
> 
> Signed-off-by: Mickaël Salaün 
> Fixes: 969bf05eb3ce ("bpf: direct packet access")
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 

Can you post this fix separately? It's valid and needed outside of the
scope of this series.


Re: [RFC PATCH v2] net: sched: convert qdisc linked list to hashtable

2016-07-08 Thread Thomas Graf
On 07/07/16 at 10:36pm, Jiri Kosina wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f45929c..630838e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct netpoll_info;
>  struct device;
> @@ -1778,6 +1779,7 @@ struct net_device {
>   unsigned intnum_tx_queues;
>   unsigned intreal_num_tx_queues;
>   struct Qdisc*qdisc;
> + DECLARE_HASHTABLE   (qdisc_hash, 16);

This blows up net_device to an insane size: 64K * sizeof(struct
hlist_head). Can we allocate this on demand for net_devices where
it is actually needed? The majority of virtual devices won't need
this. Doesn't have to be rhashtable, can still be fixed size but
at least allocate it.


Re: [RFC PATCH v2] net: sched: convert qdisc linked list to hashtable

2016-07-08 Thread Thomas Graf
On 07/07/16 at 10:36pm, Jiri Kosina wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f45929c..630838e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct netpoll_info;
>  struct device;
> @@ -1778,6 +1779,7 @@ struct net_device {
>   unsigned intnum_tx_queues;
>   unsigned intreal_num_tx_queues;
>   struct Qdisc*qdisc;
> + DECLARE_HASHTABLE   (qdisc_hash, 16);

This blows up net_device to an insane size: 64K * sizeof(struct
hlist_head). Can we allocate this on demand for net_devices where
it is actually needed? The majority of virtual devices won't need
this. Doesn't have to be rhashtable, can still be fixed size but
at least allocate it.


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-08 Thread Thomas Graf
On 12/09/15 at 10:38am, Herbert Xu wrote:
> On Wed, Dec 09, 2015 at 03:36:32AM +0100, Thomas Graf wrote:
> > 
> > Without knowing your exact implementation plans: introducing an
> > additional reference indirection for every lookup will have a
> > huge performance penalty as well.
> > 
> > Is your plan to only introduce the master table after an
> > allocation has failed?
> 
> Right, obviously the extra indirections would only come into play
> after a failed allocation.  As soon as we can run the worker thread
> it'll try to remove the extra indirections by doing vmalloc.

OK, this sounds like a good compromise. The penalty is isolated
for the duration of the atomic burst.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-08 Thread Thomas Graf
On 12/09/15 at 10:24am, Herbert Xu wrote:
> On Wed, Dec 09, 2015 at 03:18:26AM +0100, Thomas Graf wrote:
> > 
> > Assuming that we only encounter this scenario with very large
> > table sizes, it might be OK to assume that deferring the actual
> > resize via the worker thread while continuing to insert above
> > 100% utilization in atomic context is safe.
> 
> As test_rhashtable has demonstrated already this approach doesn't
> work.  There is nothing in the kernel that will ensure that the
> worker thread gets to run at all.

If we define work assuming that an insertion in atomic context
should never fail then yes.  I'm not sure you can guarantee that
with a segmented table either though. I agree though that the
insertion behaviour is much better defined.

My argument is that if we are in a situation in which a worker
thread is never invoked and we've grown 2x from the original
table size, do we still need entries to be inserted into the
table or can we fail?

Without knowing your exact implementation plans: introducing an
additional reference indirection for every lookup will have a
huge performance penalty as well.

Is your plan to only introduce the master table after an
allocation has failed?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-08 Thread Thomas Graf

On 12/05/15 at 03:06pm, Herbert Xu wrote:
> Unless we can make __vmalloc work with BH disabled, I guess we'll
> have to go back to multi-level lookups unless someone has a better
> suggestion.

Assuming that we only encounter this scenario with very large
table sizes, it might be OK to assume that deferring the actual
resize via the worker thread while continuing to insert above
100% utilization in atomic context is safe.

On 12/07/15 at 02:29pm, David Miller wrote:
> You can't issue the cross-cpu TLB flushes from atomic contexts.
> It's the kernel page table updates that create the restriction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-08 Thread Thomas Graf
On 12/09/15 at 10:24am, Herbert Xu wrote:
> On Wed, Dec 09, 2015 at 03:18:26AM +0100, Thomas Graf wrote:
> > 
> > Assuming that we only encounter this scenario with very large
> > table sizes, it might be OK to assume that deferring the actual
> > resize via the worker thread while continuing to insert above
> > 100% utilization in atomic context is safe.
> 
> As test_rhashtable has demonstrated already this approach doesn't
> work.  There is nothing in the kernel that will ensure that the
> worker thread gets to run at all.

If we define work assuming that an insertion in atomic context
should never fail then yes.  I'm not sure you can guarantee that
with a segmented table either though. I agree though that the
insertion behaviour is much better defined.

My argument is that if we are in a situation in which a worker
thread is never invoked and we've grown 2x from the original
table size, do we still need entries to be inserted into the
table or can we fail?

Without knowing your exact implementation plans: introducing an
additional reference indirection for every lookup will have a
huge performance penalty as well.

Is your plan to only introduce the master table after an
allocation has failed?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-08 Thread Thomas Graf

On 12/05/15 at 03:06pm, Herbert Xu wrote:
> Unless we can make __vmalloc work with BH disabled, I guess we'll
> have to go back to multi-level lookups unless someone has a better
> suggestion.

Assuming that we only encounter this scenario with very large
table sizes, it might be OK to assume that deferring the actual
resize via the worker thread while continuing to insert above
100% utilization in atomic context is safe.

On 12/07/15 at 02:29pm, David Miller wrote:
> You can't issue the cross-cpu TLB flushes from atomic contexts.
> It's the kernel page table updates that create the restriction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-08 Thread Thomas Graf
On 12/09/15 at 10:38am, Herbert Xu wrote:
> On Wed, Dec 09, 2015 at 03:36:32AM +0100, Thomas Graf wrote:
> > 
> > Without knowing your exact implementation plans: introducing an
> > additional reference indirection for every lookup will have a
> > huge performance penalty as well.
> > 
> > Is your plan to only introduce the master table after an
> > allocation has failed?
> 
> Right, obviously the extra indirections would only come into play
> after a failed allocation.  As soon as we can run the worker thread
> it'll try to remove the extra indirections by doing vmalloc.

OK, this sounds like a good compromise. The penalty is isolated
for the duration of the atomic burst.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-07 Thread Thomas Graf
On 12/05/15 at 03:06pm, Herbert Xu wrote:
> On Fri, Dec 04, 2015 at 07:15:55PM +0100, Phil Sutter wrote:
> >
> > > Only one should really do this, while others are waiting.
> > 
> > Sure, that was my previous understanding of how this thing works.
> 
> Yes that's clearly how it should be.  Unfortunately while adding
> the locking to do this, I found out that you can't actually call
> __vmalloc with BH disabled so this is a no-go.
> 
> Unless we can make __vmalloc work with BH disabled, I guess we'll
> have to go back to multi-level lookups unless someone has a better
> suggestion.

Thanks for fixing the race.

As for the remaining problem, I think we'll have to find a way to
serve a hard pounding user if we want to convert TCP hashtables
later on.

Did you look into what __vmalloc prevents to work with BH disabled?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-07 Thread Thomas Graf
On 12/05/15 at 03:06pm, Herbert Xu wrote:
> On Fri, Dec 04, 2015 at 07:15:55PM +0100, Phil Sutter wrote:
> >
> > > Only one should really do this, while others are waiting.
> > 
> > Sure, that was my previous understanding of how this thing works.
> 
> Yes that's clearly how it should be.  Unfortunately while adding
> the locking to do this, I found out that you can't actually call
> __vmalloc with BH disabled so this is a no-go.
> 
> Unless we can make __vmalloc work with BH disabled, I guess we'll
> have to go back to multi-level lookups unless someone has a better
> suggestion.

Thanks for fixing the race.

As for the remaining problem, I think we'll have to find a way to
serve a hard pounding user if we want to convert TCP hashtables
later on.

Did you look into what __vmalloc prevents to work with BH disabled?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-21 Thread Thomas Graf
On 10/21/15 at 05:17pm, Daniel Borkmann wrote:
> On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
> ...
> >Just FYI:  Using a device for this kind of interface is pretty
> >much a non-starter as that quickly gets you into situations where
> >things do not work in containers.  If someone gets a version of device
> >namespaces past GregKH it might be up for discussion to use character
> >devices.
> 
> Okay, you are referring to this discussion here:
> 
>   http://thread.gmane.org/gmane.linux.kernel.containers/26760
> 
> What had been mentioned earlier in this thread was to have a namespace
> pass-through facility enforced by device cgroups we have in the kernel,
> which is one out of various means used to enforce policy today by
> deployment systems such as docker, for example. But more below.
> 
> I think this all depends on the kind of expectations we have, where all
> this is going. In the original proposal, it was agreed to have the
> operation that creates a node as 'capable(CAP_SYS_ADMIN)'-only (in the
> way like most of the rest of eBPF is restricted), and based on the use
> case we distribute such objects to unprivileged applications. But I
> understand that it seems the trend lately to lift eBPF restrictions at
> some point anyway, and thus the CAP_SYS_ADMIN is suddenly irrelevant
> again. Fair enough.
> 
> Don't get me wrong, I really don't mind if it will be some version of
> this fs patch or whatever architecture else we find consensus on, I
> think this discussion is merely trying to evaluate/discuss on what seems
> to be a good fit, also in terms of future requirements and integration.
> 
> So far, during this discussion, it was proposed to modify the file system
> to a single-mount one and to stick this under /sys/kernel/bpf/. This
> will not have "real" namespace support either, but it was proposed to
> have a following structure:
> 
>   /sys/kernel/bpf/username//progX

This would probably work as you would typically map the ebpf map
using -v like this to give a stable path:

docker run -v /sys/kernel/bpf/foo/maps/progX:/map proX
 
> So, the file system will have kind of a user home-directory for each user
> to isolate through permissions, if I understood correctly.
> 
> If we really want to go this route, then I think there are no big stones
> in the way for the other model either. It should look roughly drafted like
> the below.
> 
> Together with device cgroups for containers, it would allow scenarios where
> you can have:
> 
>   * eBPF (map/prog) device pass-through so a map/prog could even be shared out
> from the initial namespace into individual ones/all (one could possibly
> extend such maps as read-only for these consumers).
>   * eBPF device creation for unprivileged users with permissions being set
> accordingly (as in fs case).
>   * Since cgroup controller can also do wildcards on major/minors, we could
> make that further fine-grained.
>   * eBPF device creation can also be enforced by the cgroup controller to be
> entirely disallowed for a specific container.
> 
> (An admin can determine the dynamically created major f.e. under 
> /proc/devices.)

I've read the discussion passively and my take away is that, frankly,
I think the differences are somewhat minor. Both architectures can
scale to what we need. Both will do the job. I'm slightly worried about
exposing uAPI as a FS, I think that didn't work too well for sysfs. It's
pretty much a define the format once and never touch it again kind of
deal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-21 Thread Thomas Graf
On 10/21/15 at 05:17pm, Daniel Borkmann wrote:
> On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
> ...
> >Just FYI:  Using a device for this kind of interface is pretty
> >much a non-starter as that quickly gets you into situations where
> >things do not work in containers.  If someone gets a version of device
> >namespaces past GregKH it might be up for discussion to use character
> >devices.
> 
> Okay, you are referring to this discussion here:
> 
>   http://thread.gmane.org/gmane.linux.kernel.containers/26760
> 
> What had been mentioned earlier in this thread was to have a namespace
> pass-through facility enforced by device cgroups we have in the kernel,
> which is one out of various means used to enforce policy today by
> deployment systems such as docker, for example. But more below.
> 
> I think this all depends on the kind of expectations we have, where all
> this is going. In the original proposal, it was agreed to have the
> operation that creates a node as 'capable(CAP_SYS_ADMIN)'-only (in the
> way like most of the rest of eBPF is restricted), and based on the use
> case we distribute such objects to unprivileged applications. But I
> understand that it seems the trend lately to lift eBPF restrictions at
> some point anyway, and thus the CAP_SYS_ADMIN is suddenly irrelevant
> again. Fair enough.
> 
> Don't get me wrong, I really don't mind if it will be some version of
> this fs patch or whatever architecture else we find consensus on, I
> think this discussion is merely trying to evaluate/discuss on what seems
> to be a good fit, also in terms of future requirements and integration.
> 
> So far, during this discussion, it was proposed to modify the file system
> to a single-mount one and to stick this under /sys/kernel/bpf/. This
> will not have "real" namespace support either, but it was proposed to
> have a following structure:
> 
>   /sys/kernel/bpf/username//progX

This would probably work as you would typically map the ebpf map
using -v like this to give a stable path:

docker run -v /sys/kernel/bpf/foo/maps/progX:/map proX
 
> So, the file system will have kind of a user home-directory for each user
> to isolate through permissions, if I understood correctly.
> 
> If we really want to go this route, then I think there are no big stones
> in the way for the other model either. It should look roughly drafted like
> the below.
> 
> Together with device cgroups for containers, it would allow scenarios where
> you can have:
> 
>   * eBPF (map/prog) device pass-through so a map/prog could even be shared out
> from the initial namespace into individual ones/all (one could possibly
> extend such maps as read-only for these consumers).
>   * eBPF device creation for unprivileged users with permissions being set
> accordingly (as in fs case).
>   * Since cgroup controller can also do wildcards on major/minors, we could
> make that further fine-grained.
>   * eBPF device creation can also be enforced by the cgroup controller to be
> entirely disallowed for a specific container.
> 
> (An admin can determine the dynamically created major f.e. under 
> /proc/devices.)

I've read the discussion passively and my take away is that, frankly,
I think the differences are somewhat minor. Both architectures can
scale to what we need. Both will do the job. I'm slightly worried about
exposing uAPI as a FS, I think that didn't work too well for sysfs. It's
pretty much a define the format once and never touch it again kind of
deal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Thomas Graf
On 10/16/15 at 10:32am, Alexei Starovoitov wrote:
> On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:
> >Oh, tracing does not allow daemons. Why? I can only imagine embedded
> >users, no?
> 
> yes and for networking: restartability and HA.
> cannot really do that with fuse/daemons.

Right, the smaller the footprint, the better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Thomas Graf
On 10/16/15 at 10:32am, Alexei Starovoitov wrote:
> On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:
> >Oh, tracing does not allow daemons. Why? I can only imagine embedded
> >users, no?
> 
> yes and for networking: restartability and HA.
> cannot really do that with fuse/daemons.

Right, the smaller the footprint, the better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-09 Thread Thomas Graf
On 10/08/15 at 08:20pm, Hannes Frederic Sowa wrote:
> Hi Alexei,
> 
> On Thu, Oct 8, 2015, at 07:23, Alexei Starovoitov wrote:
> > The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
> > This toggle defaults to off (0), but can be set true (1).  Once true,
> > bpf programs and maps cannot be accessed from unprivileged process,
> > and the toggle cannot be set back to false.
> 
> This approach seems fine to me.
> 
> I am wondering if it makes sense to somehow allow ebpf access per
> namespace? I currently have no idea how that could work and on which
> namespace type to depend or going with a prctl or even cgroup maybe. The
> rationale behind this is, that maybe some namespaces like openstack
> router namespaces could make usage of advanced ebpf capabilities in the
> kernel, while other namespaces, especially where untrusted third parties
> are hosted, shouldn't have access to those facilities.
> 
> In that way, hosters would be able to e.g. deploy more efficient
> performance monitoring container (which should still need not to run as
> root) while the majority of the users has no access to that. Or think
> about routing instances in some namespaces, etc. etc.

The standard way of granting privileges like this for containers is
through CAP_ which does seem like a good fit for this as well and would
also solve your mentioned openstack use case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next 1/3] bpf: enable non-root eBPF programs

2015-10-09 Thread Thomas Graf
On 10/08/15 at 08:20pm, Hannes Frederic Sowa wrote:
> Hi Alexei,
> 
> On Thu, Oct 8, 2015, at 07:23, Alexei Starovoitov wrote:
> > The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
> > This toggle defaults to off (0), but can be set true (1).  Once true,
> > bpf programs and maps cannot be accessed from unprivileged process,
> > and the toggle cannot be set back to false.
> 
> This approach seems fine to me.
> 
> I am wondering if it makes sense to somehow allow ebpf access per
> namespace? I currently have no idea how that could work and on which
> namespace type to depend or going with a prctl or even cgroup maybe. The
> rationale behind this is, that maybe some namespaces like openstack
> router namespaces could make usage of advanced ebpf capabilities in the
> kernel, while other namespaces, especially where untrusted third parties
> are hosted, shouldn't have access to those facilities.
> 
> In that way, hosters would be able to e.g. deploy more efficient
> performance monitoring container (which should still need not to run as
> root) while the majority of the users has no access to that. Or think
> about routing instances in some namespaces, etc. etc.

The standard way of granting privileges like this for containers is
through CAP_ which does seem like a good fit for this as well and would
also solve your mentioned openstack use case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] lib: fix data race in rhashtable_rehash_one

2015-09-22 Thread Thomas Graf
On 09/22/15 at 10:51am, Dmitry Vyukov wrote:
> rhashtable_rehash_one() uses complex logic to update entry->next field,
> after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
> 
> entry->next = 1 | ((base + off) << 1)
> 
> This can be compiled along the lines of:
> 
> entry->next = base + off
> entry->next <<= 1
> entry->next |= 1
> 
> Which will break concurrent readers.
> 
> NULLS value recomputation is not needed here, so just remove
> the complex logic.
> 
> The data race was found with KernelThreadSanitizer (KTSAN).
> 
> Signed-off-by: Dmitry Vyukov 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: fix data race in rhashtable_rehash_one

2015-09-22 Thread Thomas Graf
On 09/21/15 at 04:03pm, Eric Dumazet wrote:
> What I said is :
> 
> In @head you already have the correct nulls value, from hash table.
> 
> You do not need to recompute this value, and/or test if hash table chain
> is empty.
> 
> If hash bucket is empty, it contains the appropriate NULLS value.
> 
> If you are paranoiac add this debugging check :
> 
> if (rht_is_a_nulls(head))
> BUG_ON(head != (struct rhash_head *)rht_marker(ht, new_hash));
> 
> 
> Therefore, simply fix the bug and unnecessary code with :

You are absolutely right Eric. Do you want to revise your patch Dmitry?
Eric's proposed fix absolutely the best way to fix this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: fix data race in rhashtable_rehash_one

2015-09-22 Thread Thomas Graf
On 09/21/15 at 04:03pm, Eric Dumazet wrote:
> What I said is :
> 
> In @head you already have the correct nulls value, from hash table.
> 
> You do not need to recompute this value, and/or test if hash table chain
> is empty.
> 
> If hash bucket is empty, it contains the appropriate NULLS value.
> 
> If you are paranoiac add this debugging check :
> 
> if (rht_is_a_nulls(head))
> BUG_ON(head != (struct rhash_head *)rht_marker(ht, new_hash));
> 
> 
> Therefore, simply fix the bug and unnecessary code with :

You are absolutely right Eric. Do you want to revise your patch Dmitry?
Eric's proposed fix absolutely the best way to fix this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] lib: fix data race in rhashtable_rehash_one

2015-09-22 Thread Thomas Graf
On 09/22/15 at 10:51am, Dmitry Vyukov wrote:
> rhashtable_rehash_one() uses complex logic to update entry->next field,
> after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
> 
> entry->next = 1 | ((base + off) << 1)
> 
> This can be compiled along the lines of:
> 
> entry->next = base + off
> entry->next <<= 1
> entry->next |= 1
> 
> Which will break concurrent readers.
> 
> NULLS value recomputation is not needed here, so just remove
> the complex logic.
> 
> The data race was found with KernelThreadSanitizer (KTSAN).
> 
> Signed-off-by: Dmitry Vyukov <dvyu...@google.com>

Acked-by: Thomas Graf <tg...@suug.ch>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: fix data race in rhashtable_rehash_one

2015-09-21 Thread Thomas Graf
On 09/21/15 at 07:51am, Eric Dumazet wrote:
> The important part here is that we rehash an item, so we need to make
> sure to maintain consistent ->next field, and need to prevent compiler
> from using ->next as a temporary variable.
> 
> ptr->next = 1UL | ((base + offset) << 1);
> 
> Is dangerous because compiler could issue :
> 
> ptr->next = (base + offset);
> 
> ptr->next <<= 1;
> 
> ptr->next += 1UL;
> 
> Frankly, all this looks like an oversight in this code.
> 
> Not sure why the NULLS value is even recomputed.

The hash of the chain is part of the NULLS value. Since the
entry might have been moved to a different chain, the NULLS
value must be recalculated to contain the proper hash.

However, nobody is using the hash today as far as I can
see so we could as well just remove it and use the base
value only for the nulls marker.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lib: fix data race in rhashtable_rehash_one

2015-09-21 Thread Thomas Graf
On 09/21/15 at 07:51am, Eric Dumazet wrote:
> The important part here is that we rehash an item, so we need to make
> sure to maintain consistent ->next field, and need to prevent compiler
> from using ->next as a temporary variable.
> 
> ptr->next = 1UL | ((base + offset) << 1);
> 
> Is dangerous because compiler could issue :
> 
> ptr->next = (base + offset);
> 
> ptr->next <<= 1;
> 
> ptr->next += 1UL;
> 
> Frankly, all this looks like an oversight in this code.
> 
> Not sure why the NULLS value is even recomputed.

The hash of the chain is part of the NULLS value. Since the
entry might have been moved to a different chain, the NULLS
value must be recalculated to contain the proper hash.

However, nobody is using the hash today as far as I can
see so we could as well just remove it and use the base
value only for the nulls marker.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible netlink autobind regression

2015-09-17 Thread Thomas Graf
On 09/17/15 at 01:15pm, Herbert Xu wrote:
> On Wed, Sep 16, 2015 at 10:02:00PM -0700, Cong Wang wrote:
> >
> > This part doesn't look correct, seems it is checking if this is a kernel
> > netlink socket rather than if it is bound. But I am not sure...
> 
> Good point.  I've changed it so that bound is only set for non-kernel
> sockets.
> 
> ---8<---
> netlink: Fix autobind race condition that leads to zero port ID
> 
> The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> Reset portid after netlink_insert failure") introduced a race
> condition where if two threads tried to autobind the same socket
> one of them may end up with a zero port ID.
> 
> This patch reverts that commit and instead fixes it by introducing
> a separte "bound" variable to indicate whether a user-space socket
> has been bound.
> 
> Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> Reported-by: Tejun Heo 
> Reported-by: Linus Torvalds 
> Signed-off-by: Herbert Xu 
> Reviewed-by: Cong Wang 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible netlink autobind regression

2015-09-17 Thread Thomas Graf
On 09/17/15 at 01:15pm, Herbert Xu wrote:
> On Wed, Sep 16, 2015 at 10:02:00PM -0700, Cong Wang wrote:
> >
> > This part doesn't look correct, seems it is checking if this is a kernel
> > netlink socket rather than if it is bound. But I am not sure...
> 
> Good point.  I've changed it so that bound is only set for non-kernel
> sockets.
> 
> ---8<---
> netlink: Fix autobind race condition that leads to zero port ID
> 
> The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> Reset portid after netlink_insert failure") introduced a race
> condition where if two threads tried to autobind the same socket
> one of them may end up with a zero port ID.
> 
> This patch reverts that commit and instead fixes it by introducing
> a separte "bound" variable to indicate whether a user-space socket
> has been bound.
> 
> Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> Reported-by: Tejun Heo <t...@kernel.org>
> Reported-by: Linus Torvalds <torva...@linux-foundation.org>
> Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>
> Reviewed-by: Cong Wang <cw...@twopensource.com>

Acked-by: Thomas Graf <tg...@suug.ch>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vxlan: Refactor vxlan_udp_encap_recv() to kill compiler warning

2015-09-04 Thread Thomas Graf
On 09/04/15 at 12:49pm, Geert Uytterhoeven wrote:
> drivers/net/vxlan.c: In function ‘vxlan_udp_encap_recv’:
> drivers/net/vxlan.c:1226: warning: ‘info’ may be used uninitialized in this 
> function
> 
> While this warning is a false positive, it can be killed easily by
> getting rid of the pointer intermediary and referring directly to the
> ip_tunnel_info structure.
> 
> Signed-off-by: Geert Uytterhoeven 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vxlan: Refactor vxlan_udp_encap_recv() to kill compiler warning

2015-09-04 Thread Thomas Graf
On 09/04/15 at 12:49pm, Geert Uytterhoeven wrote:
> drivers/net/vxlan.c: In function ‘vxlan_udp_encap_recv’:
> drivers/net/vxlan.c:1226: warning: ‘info’ may be used uninitialized in this 
> function
> 
> While this warning is a false positive, it can be killed easily by
> getting rid of the pointer intermediary and referring directly to the
> ip_tunnel_info structure.
> 
> Signed-off-by: Geert Uytterhoeven <ge...@linux-m68k.org>

Acked-by: Thomas Graf <tg...@suug.ch>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-02 Thread Thomas Graf
On 09/02/15 at 10:00am, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 04:51:24PM +0200, Thomas Graf wrote:
> >
> > 1. The current in-kernel self-test
> > 2. bind_netlink.c: https://github.com/tgraf/rhashtable
> 
> Thanks, I will try to reproduce this.

The path in question is:

int rhashtable_insert_rehash(struct rhashtable *ht)
{
[...]

old_tbl = rht_dereference_rcu(ht->tbl, ht);
tbl = rhashtable_last_table(ht, old_tbl);

size = tbl->size;

if (rht_grow_above_75(ht, tbl))
size *= 2;
/* Do not schedule more than one rehash */
else if (old_tbl != tbl)
return -EBUSY;

The behaviour in question is the immediate rehash during
insertion which we want to fail.

Commits:
ccd57b1bd32460d27bbb9c599e795628a3c66983
a87b9ebf1709687ff213091d0fdb4254b1564803
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-02 Thread Thomas Graf
On 09/02/15 at 10:00am, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 04:51:24PM +0200, Thomas Graf wrote:
> >
> > 1. The current in-kernel self-test
> > 2. bind_netlink.c: https://github.com/tgraf/rhashtable
> 
> Thanks, I will try to reproduce this.

The path in question is:

int rhashtable_insert_rehash(struct rhashtable *ht)
{
[...]

old_tbl = rht_dereference_rcu(ht->tbl, ht);
tbl = rhashtable_last_table(ht, old_tbl);

size = tbl->size;

if (rht_grow_above_75(ht, tbl))
size *= 2;
/* Do not schedule more than one rehash */
else if (old_tbl != tbl)
return -EBUSY;

The behaviour in question is the immediate rehash during
insertion which we want to fail.

Commits:
ccd57b1bd32460d27bbb9c599e795628a3c66983
a87b9ebf1709687ff213091d0fdb4254b1564803
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Thomas Graf
On 09/01/15 at 10:16pm, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 04:13:05PM +0200, Thomas Graf wrote:
> >
> > You can easily trigger this outside of the testsuite as well. Open
> > 10K Netlink sockets in a loop and the creation of the sockets will
> > fail way before memory pressure kicks in.
> 
> That means our implementation is buggy.  Do you have a test program?

1. The current in-kernel self-test
2. bind_netlink.c: https://github.com/tgraf/rhashtable
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Thomas Graf
On 09/01/15 at 10:03pm, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 03:56:18PM +0200, Phil Sutter wrote:
> > 
> > Looking at rhashtable_test.c, I see the initial table size is 8 entries.
> > 70% of that is 5.6 entries, so background expansion is started after the
> > 6th entry has been added, right? Given there are 10 threads running
> > which try to insert 50k entries at the same time, I don't think it's
> > unlikely that three more entries are inserted before the background
> > expansion completes.
> 
> Yes but in that case the GFP_ATOMIC allocation should work because
> the table is so small anyway.

You can easily trigger this outside of the testsuite as well. Open
10K Netlink sockets in a loop and the creation of the sockets will
fail way before memory pressure kicks in.

I agree with you that the user should never retry on memory failure.
That's why I suggested to differentiate between a "permanent" failure
(memory pressure) and non-permanent failure (temporary overload on
background expansion). Hence the proposed difference of return codes
ENOMEM and EBUSY to report this back to the API user.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Thomas Graf
On 09/01/15 at 10:16pm, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 04:13:05PM +0200, Thomas Graf wrote:
> >
> > You can easily trigger this outside of the testsuite as well. Open
> > 10K Netlink sockets in a loop and the creation of the sockets will
> > fail way before memory pressure kicks in.
> 
> That means our implementation is buggy.  Do you have a test program?

1. The current in-kernel self-test
2. bind_netlink.c: https://github.com/tgraf/rhashtable
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Thomas Graf
On 09/01/15 at 10:03pm, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 03:56:18PM +0200, Phil Sutter wrote:
> > 
> > Looking at rhashtable_test.c, I see the initial table size is 8 entries.
> > 70% of that is 5.6 entries, so background expansion is started after the
> > 6th entry has been added, right? Given there are 10 threads running
> > which try to insert 50k entries at the same time, I don't think it's
> > unlikely that three more entries are inserted before the background
> > expansion completes.
> 
> Yes but in that case the GFP_ATOMIC allocation should work because
> the table is so small anyway.

You can easily trigger this outside of the testsuite as well. Open
10K Netlink sockets in a loop and the creation of the sockets will
fail way before memory pressure kicks in.

I agree with you that the user should never retry on memory failure.
That's why I suggested to differentiate between a "permanent" failure
(memory pressure) and non-permanent failure (temporary overload on
background expansion). Hence the proposed difference of return codes
ENOMEM and EBUSY to report this back to the API user.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Thomas Graf
On 08/28/15 at 03:34pm, Phil Sutter wrote:
> Quite ugly, IMHO: rhashtable_insert_fast() may return -ENOMEM as
> non-permanent error, if allocation in GFP_ATOMIC failed. In this case,
> allocation in GFP_KERNEL is retried by rht_deferred_worker(). Sadly,
> there is no way to determine if that has already been tried and failed.
> 
> The thread test triggers GFP_ATOMIC allocation failure quite easily, so
> I can't really just ignore this issue. :)

Return EBUSY or ENOBUFS in the non-permanent case? It is definitely
helpful if the API allows to differ between permanent and
non-permanent errors.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] rhashtable-test: calculate max_entries value by default

2015-08-28 Thread Thomas Graf
On 08/28/15 at 12:28pm, Phil Sutter wrote:
> A maximum table size of 64k entries is insufficient for the multiple
> threads test even in default configuration (10 threads * 5 objects =
> 50 objects in total). Since we know how many objects will be
> inserted, calculate the max size unless overridden by parameter.
> 
> Note that specifying the exact number of objects upon table init won't
> suffice as that value is being rounded down to the next power of two -
> anticipate this by rounding up to the next power of two in beforehand.
> 
> Signed-off-by: Phil Sutter 

Acked-by: Thomas Graf 

Thanks for doing this work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Thomas Graf
On 08/28/15 at 12:28pm, Phil Sutter wrote:
> After adding cond_resched() calls to threadfunc(), a surprisingly high
> rate of insert failures occurred probably due to table resizes getting a
> better chance to run in background. To not soften up the remaining
> tests, retry inserts until they either succeed or fail permanently.
> 
> Signed-off-by: Phil Sutter 
> ---
>  lib/test_rhashtable.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> index 63654e3..093cf84 100644
> --- a/lib/test_rhashtable.c
> +++ b/lib/test_rhashtable.c
> @@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data *tdata)
>  
>  static int threadfunc(void *data)
>  {
> - int i, step, err = 0, insert_fails = 0;
> + int i, step, err = 0, retries = 0;
>   struct thread_data *tdata = data;
>  
>   up(_sem);
> @@ -253,21 +253,22 @@ static int threadfunc(void *data)
>  
>   for (i = 0; i < entries; i++) {
>   tdata->objs[i].value = (tdata->id << 16) | i;
> +insert_retry:
>   cond_resched();
>   err = rhashtable_insert_fast(, >objs[i].node,
>test_rht_params);
>   if (err == -ENOMEM || err == -EBUSY) {
> - tdata->objs[i].value = TEST_INSERT_FAIL;
> - insert_fails++;
> + retries++;
> + goto insert_retry;

Is it safe to retry indefinitely on ENOMEM? Retrying on EBUSY is
definitely an improvement and we should do the same in the non
threaded test as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] rhashtable-test: add cond_resched() to thread test

2015-08-28 Thread Thomas Graf
On 08/28/15 at 12:28pm, Phil Sutter wrote:
> This should fix for soft lockup bugs triggered on slow systems.
> 
> Signed-off-by: Phil Sutter 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] rhashtable-test: add cond_resched() to thread test

2015-08-28 Thread Thomas Graf
On 08/28/15 at 12:28pm, Phil Sutter wrote:
 This should fix for soft lockup bugs triggered on slow systems.
 
 Signed-off-by: Phil Sutter p...@nwl.cc

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] rhashtable-test: calculate max_entries value by default

2015-08-28 Thread Thomas Graf
On 08/28/15 at 12:28pm, Phil Sutter wrote:
 A maximum table size of 64k entries is insufficient for the multiple
 threads test even in default configuration (10 threads * 5 objects =
 50 objects in total). Since we know how many objects will be
 inserted, calculate the max size unless overridden by parameter.
 
 Note that specifying the exact number of objects upon table init won't
 suffice as that value is being rounded down to the next power of two -
 anticipate this by rounding up to the next power of two in beforehand.
 
 Signed-off-by: Phil Sutter p...@nwl.cc

Acked-by: Thomas Graf tg...@suug.ch

Thanks for doing this work.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Thomas Graf
On 08/28/15 at 12:28pm, Phil Sutter wrote:
 After adding cond_resched() calls to threadfunc(), a surprisingly high
 rate of insert failures occurred probably due to table resizes getting a
 better chance to run in background. To not soften up the remaining
 tests, retry inserts until they either succeed or fail permanently.
 
 Signed-off-by: Phil Sutter p...@nwl.cc
 ---
  lib/test_rhashtable.c | 13 +++--
  1 file changed, 7 insertions(+), 6 deletions(-)
 
 diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
 index 63654e3..093cf84 100644
 --- a/lib/test_rhashtable.c
 +++ b/lib/test_rhashtable.c
 @@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data *tdata)
  
  static int threadfunc(void *data)
  {
 - int i, step, err = 0, insert_fails = 0;
 + int i, step, err = 0, retries = 0;
   struct thread_data *tdata = data;
  
   up(prestart_sem);
 @@ -253,21 +253,22 @@ static int threadfunc(void *data)
  
   for (i = 0; i  entries; i++) {
   tdata-objs[i].value = (tdata-id  16) | i;
 +insert_retry:
   cond_resched();
   err = rhashtable_insert_fast(ht, tdata-objs[i].node,
test_rht_params);
   if (err == -ENOMEM || err == -EBUSY) {
 - tdata-objs[i].value = TEST_INSERT_FAIL;
 - insert_fails++;
 + retries++;
 + goto insert_retry;

Is it safe to retry indefinitely on ENOMEM? Retrying on EBUSY is
definitely an improvement and we should do the same in the non
threaded test as well.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Thomas Graf
On 08/28/15 at 03:34pm, Phil Sutter wrote:
 Quite ugly, IMHO: rhashtable_insert_fast() may return -ENOMEM as
 non-permanent error, if allocation in GFP_ATOMIC failed. In this case,
 allocation in GFP_KERNEL is retried by rht_deferred_worker(). Sadly,
 there is no way to determine if that has already been tried and failed.
 
 The thread test triggers GFP_ATOMIC allocation failure quite easily, so
 I can't really just ignore this issue. :)

Return EBUSY or ENOBUFS in the non-permanent case? It is definitely
helpful if the API allows to differ between permanent and
non-permanent errors.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 10/10] openvswitch: Allow attaching helpers to ct action

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
> Add support for using conntrack helpers to assist protocol detection.
> The new OVS_CT_ATTR_HELPER attribute of the CT action specifies a helper
> to be used for this connection. If no helper is specified, then helpers
> will be automatically applied as per the sysctl configuration of
> net.netfilter.nf_conntrack_helper.
> 
> The helper may be specified as part of the conntrack action, eg:
> ct(helper=ftp). Initial packets for related connections should be
> committed to allow later packets for the flow to be considered
> established.
> 
> Example ovs-ofctl flows allowing FTP connections from ports 1->2:
> in_port=1,tcp,action=ct(helper=ftp,commit),2
> in_port=2,tcp,ct_state=-trk,action=ct(recirc)
> in_port=2,tcp,ct_state=+trk-new+est,action=1
> in_port=2,tcp,ct_state=+trk+rel,action=1

Not subject to this patch but we may want to revisit the syntax of the
state flags. It's neatly compressed like this but ct_state=untracked
ct_state=related might be more readable. The +trk should be implicit
for anything !untracked

> Signed-off-by: Joe Stringer 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 09/10] openvswitch: Allow matching on conntrack label

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
> Allow matching and setting the ct_label field. As with ct_mark, this is
> populated by executing the CT action. The label field may be modified by
> specifying a label and mask nested under the CT action. It is stored as
> metadata attached to the connection. Label modification occurs after
> lookup, and will only persist when the conntrack entry is committed by
> providing the COMMIT flag to the CT action. Labels are currently fixed
> to 128 bits in size.
> 
> Signed-off-by: Joe Stringer 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] route: put lwstate before freeing dst to avoid use after free

2015-08-25 Thread Thomas Graf
On 08/25/15 at 02:25pm, Sasha Levin wrote:
> Commit 61adedf3 ("route: move lwtunnel state to dst_entry") is trying to
> release lwstate after getting rid of dst, which causes a use-after-free
> trying to access dst->lwstate.
> 
> Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry")
> Signed-off-by: Sasha Levin 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 06/10] openvswitch: Allow matching on conntrack mark

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
> Allow matching and setting the ct_mark field. As with ct_state and
> ct_zone, these fields are populated when the CT action is executed. To
> write to this field, a value and mask can be specified as a nested
> attribute under the CT action. This data is stored with the conntrack
> entry, and is executed after the lookup occurs for the CT action. The
> conntrack entry itself must be committed using the COMMIT flag in the CT
> action flags for this change to persist.
> 
> Signed-off-by: Justin Pettit 
> Signed-off-by: Joe Stringer 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 05/10] openvswitch: Add conntrack action

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
> Expose the kernel connection tracker via OVS. Userspace components can
> make use of the CT action to populate the connection state (ct_state)
> field for a flow. This state can be subsequently matched.
> 
> Exposed connection states are OVS_CS_F_*:
> - NEW (0x01) - Beginning of a new connection.
> - ESTABLISHED (0x02) - Part of an existing connection.
> - RELATED (0x04) - Related to an established connection.
> - INVALID (0x20) - Could not track the connection for this packet.
> - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
> - TRACKED (0x80) - This packet has been sent through conntrack.
> 
> When the CT action is executed by itself, it will send the packet
> through the connection tracker and populate the ct_state field with one
> or more of the connection state flags above. The CT action will always
> set the TRACKED bit.
> 
> When the COMMIT flag is passed to the conntrack action, this specifies
> that information about the connection should be stored. This allows
> subsequent packets for the same (or related) connections to be
> correlated with this connection. Sending subsequent packets for the
> connection through conntrack allows the connection tracker to consider
> the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.
> 
> The CT action may optionally take a zone to track the flow within. This
> allows connections with the same 5-tuple to be kept logically separate
> from connections in other zones. If the zone is specified, then the
> "ct_zone" match field will be subsequently populated with the zone id.
> 
> IP fragments are handled by transparently assembling them as part of the
> CT action. The maximum received unit (MRU) size is tracked so that
> refragmentation can occur during output.
> 
> IP frag handling contributed by Andy Zhou.
> 
> Signed-off-by: Joe Stringer 
> Signed-off-by: Justin Pettit 
> Signed-off-by: Andy Zhou 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 04/10] dst: Add __skb_dst_copy() variation

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
> This variation on skb_dst_copy() doesn't require two skbs.
> 
> Signed-off-by: Joe Stringer 
> Acked-by: Pravin B Shelar 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 01/10] openvswitch: Serialize acts with original netlink len

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
> Previously, we used the kernel-internal netlink actions length to
> calculate the size of messages to serialize back to userspace.
> However,the sw_flow_actions may not be formatted exactly the same as the
> actions on the wire, so store the original actions length when
> de-serializing and re-use the original length when serializing.
> 
> Signed-off-by: Joe Stringer 
> Acked-by: Pravin B Shelar 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 05/10] openvswitch: Add conntrack action

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
 Expose the kernel connection tracker via OVS. Userspace components can
 make use of the CT action to populate the connection state (ct_state)
 field for a flow. This state can be subsequently matched.
 
 Exposed connection states are OVS_CS_F_*:
 - NEW (0x01) - Beginning of a new connection.
 - ESTABLISHED (0x02) - Part of an existing connection.
 - RELATED (0x04) - Related to an established connection.
 - INVALID (0x20) - Could not track the connection for this packet.
 - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
 - TRACKED (0x80) - This packet has been sent through conntrack.
 
 When the CT action is executed by itself, it will send the packet
 through the connection tracker and populate the ct_state field with one
 or more of the connection state flags above. The CT action will always
 set the TRACKED bit.
 
 When the COMMIT flag is passed to the conntrack action, this specifies
 that information about the connection should be stored. This allows
 subsequent packets for the same (or related) connections to be
 correlated with this connection. Sending subsequent packets for the
 connection through conntrack allows the connection tracker to consider
 the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.
 
 The CT action may optionally take a zone to track the flow within. This
 allows connections with the same 5-tuple to be kept logically separate
 from connections in other zones. If the zone is specified, then the
 ct_zone match field will be subsequently populated with the zone id.
 
 IP fragments are handled by transparently assembling them as part of the
 CT action. The maximum received unit (MRU) size is tracked so that
 refragmentation can occur during output.
 
 IP frag handling contributed by Andy Zhou.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com
 Signed-off-by: Justin Pettit jpet...@nicira.com
 Signed-off-by: Andy Zhou az...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 10/10] openvswitch: Allow attaching helpers to ct action

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
 Add support for using conntrack helpers to assist protocol detection.
 The new OVS_CT_ATTR_HELPER attribute of the CT action specifies a helper
 to be used for this connection. If no helper is specified, then helpers
 will be automatically applied as per the sysctl configuration of
 net.netfilter.nf_conntrack_helper.
 
 The helper may be specified as part of the conntrack action, eg:
 ct(helper=ftp). Initial packets for related connections should be
 committed to allow later packets for the flow to be considered
 established.
 
 Example ovs-ofctl flows allowing FTP connections from ports 1-2:
 in_port=1,tcp,action=ct(helper=ftp,commit),2
 in_port=2,tcp,ct_state=-trk,action=ct(recirc)
 in_port=2,tcp,ct_state=+trk-new+est,action=1
 in_port=2,tcp,ct_state=+trk+rel,action=1

Not subject to this patch but we may want to revisit the syntax of the
state flags. It's neatly compressed like this but ct_state=untracked
ct_state=related might be more readable. The +trk should be implicit
for anything !untracked

 Signed-off-by: Joe Stringer joestrin...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 04/10] dst: Add __skb_dst_copy() variation

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
 This variation on skb_dst_copy() doesn't require two skbs.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com
 Acked-by: Pravin B Shelar pshe...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] route: put lwstate before freeing dst to avoid use after free

2015-08-25 Thread Thomas Graf
On 08/25/15 at 02:25pm, Sasha Levin wrote:
 Commit 61adedf3 (route: move lwtunnel state to dst_entry) is trying to
 release lwstate after getting rid of dst, which causes a use-after-free
 trying to access dst-lwstate.
 
 Fixes: 61adedf3 (route: move lwtunnel state to dst_entry)
 Signed-off-by: Sasha Levin sasha.le...@oracle.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 09/10] openvswitch: Allow matching on conntrack label

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
 Allow matching and setting the ct_label field. As with ct_mark, this is
 populated by executing the CT action. The label field may be modified by
 specifying a label and mask nested under the CT action. It is stored as
 metadata attached to the connection. Label modification occurs after
 lookup, and will only persist when the conntrack entry is committed by
 providing the COMMIT flag to the CT action. Labels are currently fixed
 to 128 bits in size.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 01/10] openvswitch: Serialize acts with original netlink len

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
 Previously, we used the kernel-internal netlink actions length to
 calculate the size of messages to serialize back to userspace.
 However,the sw_flow_actions may not be formatted exactly the same as the
 actions on the wire, so store the original actions length when
 de-serializing and re-use the original length when serializing.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com
 Acked-by: Pravin B Shelar pshe...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv5 net-next 06/10] openvswitch: Allow matching on conntrack mark

2015-08-25 Thread Thomas Graf
On 08/24/15 at 05:32pm, Joe Stringer wrote:
 Allow matching and setting the ct_mark field. As with ct_state and
 ct_zone, these fields are populated when the CT action is executed. To
 write to this field, a value and mask can be specified as a nested
 attribute under the CT action. This data is stored with the conntrack
 entry, and is executed after the lookup occurs for the CT action. The
 conntrack entry itself must be committed using the COMMIT flag in the CT
 action flags for this change to persist.
 
 Signed-off-by: Justin Pettit jpet...@nicira.com
 Signed-off-by: Joe Stringer joestrin...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 net-next 08/10] netfilter: connlabels: Export setting connlabel length

2015-08-20 Thread Thomas Graf
On 08/18/15 at 04:39pm, Joe Stringer wrote:
> Add functions to change connlabel length into nf_conntrack_labels.c so
> they may be reused by other modules like OVS and nftables without
> needing to jump through xt_match_check() hoops.
> 
> Suggested-by: Florian Westphal 
> Signed-off-by: Joe Stringer 
> Acked-by: Florian Westphal 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 net-next 07/10] netfilter: Always export nf_connlabels_replace()

2015-08-20 Thread Thomas Graf
On 08/18/15 at 04:39pm, Joe Stringer wrote:
> The following patches will reuse this code from OVS.
> 
> Signed-off-by: Joe Stringer 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 net-next 05/10] openvswitch: Add conntrack action

2015-08-20 Thread Thomas Graf
On 08/18/15 at 04:39pm, Joe Stringer wrote:
> Expose the kernel connection tracker via OVS. Userspace components can
> make use of the "ct()" action, followed by "recirculate", to populate
> the conntracking state in the OVS flow key, and subsequently match on
> that state.
> 
> Example ODP flows allowing traffic from 1->2, only replies from 2->1:
> in_port=1,tcp,action=ct(commit,zone=1),2
> in_port=2,ct_state=-trk,tcp,action=ct(zone=1),recirc(1)
> recirc_id=1,in_port=2,ct_state=+trk+est-new,tcp,action=1
> 
> IP fragments are handled by transparently assembling them as part of the
> ct action. The maximum received unit (MRU) size is tracked so that
> refragmentation can occur during output.
> 
> IP frag handling contributed by Andy Zhou.
> 
> Signed-off-by: Joe Stringer 
> Signed-off-by: Justin Pettit 
> Signed-off-by: Andy Zhou 

OVS bits look great.

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 net-next 08/10] netfilter: connlabels: Export setting connlabel length

2015-08-20 Thread Thomas Graf
On 08/18/15 at 04:39pm, Joe Stringer wrote:
 Add functions to change connlabel length into nf_conntrack_labels.c so
 they may be reused by other modules like OVS and nftables without
 needing to jump through xt_match_check() hoops.
 
 Suggested-by: Florian Westphal f...@strlen.de
 Signed-off-by: Joe Stringer joestrin...@nicira.com
 Acked-by: Florian Westphal f...@strlen.de

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 net-next 05/10] openvswitch: Add conntrack action

2015-08-20 Thread Thomas Graf
On 08/18/15 at 04:39pm, Joe Stringer wrote:
 Expose the kernel connection tracker via OVS. Userspace components can
 make use of the ct() action, followed by recirculate, to populate
 the conntracking state in the OVS flow key, and subsequently match on
 that state.
 
 Example ODP flows allowing traffic from 1-2, only replies from 2-1:
 in_port=1,tcp,action=ct(commit,zone=1),2
 in_port=2,ct_state=-trk,tcp,action=ct(zone=1),recirc(1)
 recirc_id=1,in_port=2,ct_state=+trk+est-new,tcp,action=1
 
 IP fragments are handled by transparently assembling them as part of the
 ct action. The maximum received unit (MRU) size is tracked so that
 refragmentation can occur during output.
 
 IP frag handling contributed by Andy Zhou.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com
 Signed-off-by: Justin Pettit jpet...@nicira.com
 Signed-off-by: Andy Zhou az...@nicira.com

OVS bits look great.

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 net-next 07/10] netfilter: Always export nf_connlabels_replace()

2015-08-20 Thread Thomas Graf
On 08/18/15 at 04:39pm, Joe Stringer wrote:
 The following patches will reuse this code from OVS.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rhashtable-test: extend to test concurrency

2015-08-16 Thread Thomas Graf
On 08/15/15 at 12:37am, Phil Sutter wrote:
> After having tested insertion, lookup, table walk and removal, spawn a
> number of threads running operations on the same rhashtable. Each of
> them will:
> 
> 1) insert it's own set of objects,
> 2) lookup every successfully inserted object and finally
> 3) remove objects in several rounds until all of them have been removed,
>making sure the remaining ones are still found after each round.
> 
> This should put a good amount of load onto the system and due to
> synchronising thread startup via two semaphores also extensive
> concurrent table access.
> 
> The default number of ten threads returned within half a second on my
> local VM with two cores. Running 200 threads took about four seconds. If
> slow systems suffer too much from this though, the default could be
> lowered or even set to zero so this extended test does not run at all by
> default.
> 
> Signed-off-by: Phil Sutter 

Looks great. A default of 10 makes sense as well. Thanks a lot!

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rhashtable-test: extend to test concurrency

2015-08-16 Thread Thomas Graf
On 08/15/15 at 12:37am, Phil Sutter wrote:
 After having tested insertion, lookup, table walk and removal, spawn a
 number of threads running operations on the same rhashtable. Each of
 them will:
 
 1) insert it's own set of objects,
 2) lookup every successfully inserted object and finally
 3) remove objects in several rounds until all of them have been removed,
making sure the remaining ones are still found after each round.
 
 This should put a good amount of load onto the system and due to
 synchronising thread startup via two semaphores also extensive
 concurrent table access.
 
 The default number of ten threads returned within half a second on my
 local VM with two cores. Running 200 threads took about four seconds. If
 slow systems suffer too much from this though, the default could be
 lowered or even set to zero so this extended test does not run at all by
 default.
 
 Signed-off-by: Phil Sutter p...@nwl.cc

Looks great. A default of 10 makes sense as well. Thanks a lot!

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

2015-08-01 Thread Thomas Graf
On 07/31/15 at 10:51am, Joe Stringer wrote:
> On 31 July 2015 at 07:34, Hannes Frederic Sowa  wrote:
> > In general, this shouldn't be necessary as the packet should already be
> > scrubbed before they arrive here.
> >
> > Could you maybe add a WARN_ON and check how those skbs with conntrack
> > data traverse the stack? I also didn't understand why make it dependent
> > on the socket.
> 
> OK, sure. One case I could think of is with an OVS internal port in
> another namespace, directly attached to the bridge. I'll have a play
> around with WARN_ON and see if I can come up with something more
> trimmed down.

The OVS internal port will definitely pass through an unscrubbed
packet across namespaces. I think the proper thing to do would be
to scrub but conditionally keep metadata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

2015-08-01 Thread Thomas Graf
On 07/31/15 at 10:51am, Joe Stringer wrote:
 On 31 July 2015 at 07:34, Hannes Frederic Sowa han...@redhat.com wrote:
  In general, this shouldn't be necessary as the packet should already be
  scrubbed before they arrive here.
 
  Could you maybe add a WARN_ON and check how those skbs with conntrack
  data traverse the stack? I also didn't understand why make it dependent
  on the socket.
 
 OK, sure. One case I could think of is with an OVS internal port in
 another namespace, directly attached to the bridge. I'll have a play
 around with WARN_ON and see if I can come up with something more
 trimmed down.

The OVS internal port will definitely pass through an unscrubbed
packet across namespaces. I think the proper thing to do would be
to scrub but conditionally keep metadata.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

2015-07-31 Thread Thomas Graf
On 07/30/15 at 04:16pm, Joe Stringer wrote:
> On 30 July 2015 at 11:40, Thomas Graf  wrote:
> > On 07/30/15 at 11:12am, Joe Stringer wrote:
> >> Signed-off-by: Joe Stringer 
> >
> > Can you write a few lines on why this is needed? I have flows which
> > use the mark to communicate with netfilter through internal ports.
> 
> The problem I was seeing is when packets come from a different
> namespace on the localhost, they still have conntrack data associated.
> This doesn't make sense, so the intention is to perform nf_reset().
> However, it seems like we should actually be doing a bit more - at
> least the skb_dst_drop() and perhaps some of the other stuff in
> skb_scrub_packet().
> 
> Do you want to retain the mark when transitioning between namespaces?

Since we have retained it so far I think we should keep on doing
that. I'm pretty sure there are users of it out there besides me.
As you know, it's common to have tap devices in between OVS and the
guest in OpenStack and install netfilter rules there.

As for whether we should scrub it in between namespaces. Probably yes
but it's definitely tremendously useful to be able to transfer some
metadata (mark and dst metadata) between namespaces. The default
behaviour should probably be to scrub it with a flag to keep it. If
that flag is not set and nsid of port != bridge then we scrub the mark
and other metadata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

2015-07-31 Thread Thomas Graf
On 07/30/15 at 04:16pm, Joe Stringer wrote:
 On 30 July 2015 at 11:40, Thomas Graf tg...@suug.ch wrote:
  On 07/30/15 at 11:12am, Joe Stringer wrote:
  Signed-off-by: Joe Stringer joestrin...@nicira.com
 
  Can you write a few lines on why this is needed? I have flows which
  use the mark to communicate with netfilter through internal ports.
 
 The problem I was seeing is when packets come from a different
 namespace on the localhost, they still have conntrack data associated.
 This doesn't make sense, so the intention is to perform nf_reset().
 However, it seems like we should actually be doing a bit more - at
 least the skb_dst_drop() and perhaps some of the other stuff in
 skb_scrub_packet().
 
 Do you want to retain the mark when transitioning between namespaces?

Since we have retained it so far I think we should keep on doing
that. I'm pretty sure there are users of it out there besides me.
As you know, it's common to have tap devices in between OVS and the
guest in OpenStack and install netfilter rules there.

As for whether we should scrub it in between namespaces. Probably yes
but it's definitely tremendously useful to be able to transfer some
metadata (mark and dst metadata) between namespaces. The default
behaviour should probably be to scrub it with a flag to keep it. If
that flag is not set and nsid of port != bridge then we scrub the mark
and other metadata.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 4/9] ipv6: Export nf_ct_frag6_gather()

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
> Signed-off-by: Joe Stringer 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 3/9] openvswitch: Move MASKED* macros to datapath.h

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
> This will allow the ovs-conntrack code to reuse these macros.
> 
> Signed-off-by: Joe Stringer 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 2/9] openvswitch: Serialize acts with original netlink len

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
> Previously, we used the kernel-internal netlink actions length to
> calculate the size of messages to serialize back to userspace.
> However,the sw_flow_actions may not be formatted exactly the same as the
> actions on the wire, so store the original actions length when
> de-serializing and re-use the original length when serializing.
> 
> Signed-off-by: Joe Stringer 

Acked-by: Thomas Graf 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
> Signed-off-by: Joe Stringer 

Can you write a few lines on why this is needed? I have flows which
use the mark to communicate with netfilter through internal ports.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 3/9] openvswitch: Move MASKED* macros to datapath.h

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
 This will allow the ovs-conntrack code to reuse these macros.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 2/9] openvswitch: Serialize acts with original netlink len

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
 Previously, we used the kernel-internal netlink actions length to
 calculate the size of messages to serialize back to userspace.
 However,the sw_flow_actions may not be formatted exactly the same as the
 actions on the wire, so store the original actions length when
 de-serializing and re-use the original length when serializing.
 
 Signed-off-by: Joe Stringer joestrin...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 4/9] ipv6: Export nf_ct_frag6_gather()

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
 Signed-off-by: Joe Stringer joestrin...@nicira.com

Acked-by: Thomas Graf tg...@suug.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

2015-07-30 Thread Thomas Graf
On 07/30/15 at 11:12am, Joe Stringer wrote:
 Signed-off-by: Joe Stringer joestrin...@nicira.com

Can you write a few lines on why this is needed? I have flows which
use the mark to communicate with netfilter through internal ports.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] ip_tunnel: Call ip_tunnel_core_init() from inet_init()

2015-07-23 Thread Thomas Graf
On 07/23/15 at 01:28am, David Miller wrote:
> From: Thomas Graf 
> Date: Thu, 23 Jul 2015 10:08:44 +0200
> 
> > Convert the module_init() to a invocation from inet_init() since
> > ip_tunnel_core is part of the INET built-in.
> > 
> > Fixes: 3093fbe7ff4 ("route: Per route IP tunnel metadata via lightweight 
> > tunnel")
> > Signed-off-by: Thomas Graf 
> 
> I applied this instead of my module.h include fix, thanks Thomas.

Thanks, sorry for the extra trouble you had to go through.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next] ip_tunnel: Call ip_tunnel_core_init() from inet_init()

2015-07-23 Thread Thomas Graf
Convert the module_init() to a invocation from inet_init() since
ip_tunnel_core is part of the INET built-in.

Fixes: 3093fbe7ff4 ("route: Per route IP tunnel metadata via lightweight 
tunnel")
Signed-off-by: Thomas Graf 
---
Compiles for me with:
make ARCH=arm CROSS_COMPILE=arm-linux-gnu-

 include/net/ip_tunnels.h  |  2 ++
 net/ipv4/af_inet.c|  3 +++
 net/ipv4/ip_tunnel_core.c | 11 +--
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d975b3e..4798441 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -311,6 +311,8 @@ static inline int ip_tunnel_collect_metadata(void)
return static_key_false(_tunnel_metadata_cnt);
 }
 
+void __init ip_tunnel_core_init(void);
+
 void ip_tunnel_need_metadata(void);
 void ip_tunnel_unneed_metadata(void);
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..cc4e498 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -112,6 +112,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1780,6 +1781,8 @@ static int __init inet_init(void)
 
dev_add_pack(_packet_type);
 
+   ip_tunnel_core_init();
+
rc = 0;
 out:
return rc;
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 630e6d5..5512f4e 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -292,19 +292,10 @@ static const struct lwtunnel_encap_ops ip_tun_lwt_ops = {
.get_encap_size = ip_tun_encap_nlsize,
 };
 
-static int __init ip_tunnel_core_init(void)
+void __init ip_tunnel_core_init(void)
 {
lwtunnel_encap_add_ops(_tun_lwt_ops, LWTUNNEL_ENCAP_IP);
-
-   return 0;
-}
-module_init(ip_tunnel_core_init);
-
-static void __exit ip_tunnel_core_exit(void)
-{
-   lwtunnel_encap_del_ops(_tun_lwt_ops, LWTUNNEL_ENCAP_IP);
 }
-module_exit(ip_tunnel_core_exit);
 
 struct static_key ip_tunnel_metadata_cnt = STATIC_KEY_INIT_FALSE;
 EXPORT_SYMBOL(ip_tunnel_metadata_cnt);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net-next] ip_tunnel: Call ip_tunnel_core_init() from inet_init()

2015-07-23 Thread Thomas Graf
Convert the module_init() to a invocation from inet_init() since
ip_tunnel_core is part of the INET built-in.

Fixes: 3093fbe7ff4 (route: Per route IP tunnel metadata via lightweight 
tunnel)
Signed-off-by: Thomas Graf tg...@suug.ch
---
Compiles for me with:
make ARCH=arm CROSS_COMPILE=arm-linux-gnu-

 include/net/ip_tunnels.h  |  2 ++
 net/ipv4/af_inet.c|  3 +++
 net/ipv4/ip_tunnel_core.c | 11 +--
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d975b3e..4798441 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -311,6 +311,8 @@ static inline int ip_tunnel_collect_metadata(void)
return static_key_false(ip_tunnel_metadata_cnt);
 }
 
+void __init ip_tunnel_core_init(void);
+
 void ip_tunnel_need_metadata(void);
 void ip_tunnel_unneed_metadata(void);
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..cc4e498 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -112,6 +112,7 @@
 #include net/raw.h
 #include net/icmp.h
 #include net/inet_common.h
+#include net/ip_tunnels.h
 #include net/xfrm.h
 #include net/net_namespace.h
 #include net/secure_seq.h
@@ -1780,6 +1781,8 @@ static int __init inet_init(void)
 
dev_add_pack(ip_packet_type);
 
+   ip_tunnel_core_init();
+
rc = 0;
 out:
return rc;
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 630e6d5..5512f4e 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -292,19 +292,10 @@ static const struct lwtunnel_encap_ops ip_tun_lwt_ops = {
.get_encap_size = ip_tun_encap_nlsize,
 };
 
-static int __init ip_tunnel_core_init(void)
+void __init ip_tunnel_core_init(void)
 {
lwtunnel_encap_add_ops(ip_tun_lwt_ops, LWTUNNEL_ENCAP_IP);
-
-   return 0;
-}
-module_init(ip_tunnel_core_init);
-
-static void __exit ip_tunnel_core_exit(void)
-{
-   lwtunnel_encap_del_ops(ip_tun_lwt_ops, LWTUNNEL_ENCAP_IP);
 }
-module_exit(ip_tunnel_core_exit);
 
 struct static_key ip_tunnel_metadata_cnt = STATIC_KEY_INIT_FALSE;
 EXPORT_SYMBOL(ip_tunnel_metadata_cnt);
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] ip_tunnel: Call ip_tunnel_core_init() from inet_init()

2015-07-23 Thread Thomas Graf
On 07/23/15 at 01:28am, David Miller wrote:
 From: Thomas Graf tg...@suug.ch
 Date: Thu, 23 Jul 2015 10:08:44 +0200
 
  Convert the module_init() to a invocation from inet_init() since
  ip_tunnel_core is part of the INET built-in.
  
  Fixes: 3093fbe7ff4 (route: Per route IP tunnel metadata via lightweight 
  tunnel)
  Signed-off-by: Thomas Graf tg...@suug.ch
 
 I applied this instead of my module.h include fix, thanks Thomas.

Thanks, sorry for the extra trouble you had to go through.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Thomas Graf
On 07/17/15 at 12:26pm, Phil Sutter wrote:
> On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
> > On 07/02/15 at 10:09pm, Meelis Roos wrote:
> > > [   33.425061] Running rhashtable test nelem=8, max_size=65536, 
> > > shrinking=0
> > > [   33.425154] Test 00:
> > > [   33.534470]   Adding 5 keys
> > > [   34.743553] Info: encountered resize
> > > [   34.743698] Info: encountered resize
> > > [   34.743838] Info: encountered resize
> > > [   34.744057] Info: encountered resize
> > > [   34.744430] Info: encountered resize
> > > [   34.745139] Info: encountered resize
> > > [   34.746441] Info: encountered resize
> > > [   34.749055] Info: encountered resize
> > > [   34.754469] Info: encountered resize
> > > [   34.764836] Info: encountered resize
> > > [   34.785696] Info: encountered resize
> > > [   34.827448] Info: encountered resize
> > > [   34.896936]   Traversal complete: counted=49993, nelems=5, 
> > > entries=5, table-jumps=12
> > > [   34.897056] Test failed: Total count mismatch ^^^
> > 
> > I do see count mismatches as well due to the design of the walker
> > which restarts and thus sees certain entries multiple times.
> > 
> > Do you have this commit as well?
> > 
> > Author: Phil Sutter 
> > Date:   Mon Jul 6 15:51:20 2015 +0200
> > 
> > rhashtable: fix for resize events during table walk
> 
> Thomas, this should be resolved already. Meelis replied[1] to my patch,
> stating it fixes that problem for him. Though he's still waiting for
> your proposed patch to add a schedule() call so the kernel won't
> complain on his slow UltraSparc. :)
> 
> Cheers, Phil
> 
> [1]: http://www.spinics.net/lists/netdev/msg335767.html

OK, good to know. I've posted the schedule patch today:
https://patchwork.ozlabs.org/patch/497035/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Thomas Graf
On 07/02/15 at 10:09pm, Meelis Roos wrote:
> [   33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0
> [   33.425154] Test 00:
> [   33.534470]   Adding 5 keys
> [   34.743553] Info: encountered resize
> [   34.743698] Info: encountered resize
> [   34.743838] Info: encountered resize
> [   34.744057] Info: encountered resize
> [   34.744430] Info: encountered resize
> [   34.745139] Info: encountered resize
> [   34.746441] Info: encountered resize
> [   34.749055] Info: encountered resize
> [   34.754469] Info: encountered resize
> [   34.764836] Info: encountered resize
> [   34.785696] Info: encountered resize
> [   34.827448] Info: encountered resize
> [   34.896936]   Traversal complete: counted=49993, nelems=5, 
> entries=5, table-jumps=12
> [   34.897056] Test failed: Total count mismatch ^^^

I do see count mismatches as well due to the design of the walker
which restarts and thus sees certain entries multiple times.

Do you have this commit as well?

Author: Phil Sutter 
Date:   Mon Jul 6 15:51:20 2015 +0200

rhashtable: fix for resize events during table walk
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Thomas Graf
On 07/02/15 at 10:09pm, Meelis Roos wrote:
 [   33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0
 [   33.425154] Test 00:
 [   33.534470]   Adding 5 keys
 [   34.743553] Info: encountered resize
 [   34.743698] Info: encountered resize
 [   34.743838] Info: encountered resize
 [   34.744057] Info: encountered resize
 [   34.744430] Info: encountered resize
 [   34.745139] Info: encountered resize
 [   34.746441] Info: encountered resize
 [   34.749055] Info: encountered resize
 [   34.754469] Info: encountered resize
 [   34.764836] Info: encountered resize
 [   34.785696] Info: encountered resize
 [   34.827448] Info: encountered resize
 [   34.896936]   Traversal complete: counted=49993, nelems=5, 
 entries=5, table-jumps=12
 [   34.897056] Test failed: Total count mismatch ^^^

I do see count mismatches as well due to the design of the walker
which restarts and thus sees certain entries multiple times.

Do you have this commit as well?

Author: Phil Sutter p...@nwl.cc
Date:   Mon Jul 6 15:51:20 2015 +0200

rhashtable: fix for resize events during table walk
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Thomas Graf
On 07/17/15 at 12:26pm, Phil Sutter wrote:
 On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
  On 07/02/15 at 10:09pm, Meelis Roos wrote:
   [   33.425061] Running rhashtable test nelem=8, max_size=65536, 
   shrinking=0
   [   33.425154] Test 00:
   [   33.534470]   Adding 5 keys
   [   34.743553] Info: encountered resize
   [   34.743698] Info: encountered resize
   [   34.743838] Info: encountered resize
   [   34.744057] Info: encountered resize
   [   34.744430] Info: encountered resize
   [   34.745139] Info: encountered resize
   [   34.746441] Info: encountered resize
   [   34.749055] Info: encountered resize
   [   34.754469] Info: encountered resize
   [   34.764836] Info: encountered resize
   [   34.785696] Info: encountered resize
   [   34.827448] Info: encountered resize
   [   34.896936]   Traversal complete: counted=49993, nelems=5, 
   entries=5, table-jumps=12
   [   34.897056] Test failed: Total count mismatch ^^^
  
  I do see count mismatches as well due to the design of the walker
  which restarts and thus sees certain entries multiple times.
  
  Do you have this commit as well?
  
  Author: Phil Sutter p...@nwl.cc
  Date:   Mon Jul 6 15:51:20 2015 +0200
  
  rhashtable: fix for resize events during table walk
 
 Thomas, this should be resolved already. Meelis replied[1] to my patch,
 stating it fixes that problem for him. Though he's still waiting for
 your proposed patch to add a schedule() call so the kernel won't
 complain on his slow UltraSparc. :)
 
 Cheers, Phil
 
 [1]: http://www.spinics.net/lists/netdev/msg335767.html

OK, good to know. I've posted the schedule patch today:
https://patchwork.ozlabs.org/patch/497035/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Thomas Graf
On 07/16/15 at 02:15pm, Denys Vlasenko wrote:
> On 07/16/2015 12:41 PM, Thomas Graf wrote:
> > On 07/16/15 at 12:02pm, Denys Vlasenko wrote:
> >> +/* jhash - hash an arbitrary key
> >> + * @k: sequence of bytes as key
> >> + * @length: the length of the key
> >> + * @initval: the previous hash, or an arbitray value
> >> + *
> >> + * The generic version, hashes an arbitrary sequence of bytes.
> >> + * No alignment or length assumptions are made about the input key.
> >> + *
> >> + * Returns the hash value of the key. The result depends on endianness.
> >> + */
> >> +u32 jhash(const void *key, u32 length, u32 initval)
> > 
> > Shouldn't these live in lib/jhash.c or something? Otherwise
> > everyone needs to depend on CONFIG_RHASHTABLE
> 
> There is no CONFIG_RHASHTABLE, rhashtable.c is compiled unconditionally.
> 
> I will send an alternative patch, which creates jhash.c;
> apply whichever version you like most.

Right. I misread the CONFIG_TEST_RHASHTABLE. I'm fine with this then
but agree with Daniel that this must be severely tested for
performance regressions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Thomas Graf
On 07/16/15 at 12:02pm, Denys Vlasenko wrote:
> +/* jhash - hash an arbitrary key
> + * @k: sequence of bytes as key
> + * @length: the length of the key
> + * @initval: the previous hash, or an arbitray value
> + *
> + * The generic version, hashes an arbitrary sequence of bytes.
> + * No alignment or length assumptions are made about the input key.
> + *
> + * Returns the hash value of the key. The result depends on endianness.
> + */
> +u32 jhash(const void *key, u32 length, u32 initval)

Shouldn't these live in lib/jhash.c or something? Otherwise
everyone needs to depend on CONFIG_RHASHTABLE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Thomas Graf
On 07/16/15 at 12:02pm, Denys Vlasenko wrote:
 +/* jhash - hash an arbitrary key
 + * @k: sequence of bytes as key
 + * @length: the length of the key
 + * @initval: the previous hash, or an arbitray value
 + *
 + * The generic version, hashes an arbitrary sequence of bytes.
 + * No alignment or length assumptions are made about the input key.
 + *
 + * Returns the hash value of the key. The result depends on endianness.
 + */
 +u32 jhash(const void *key, u32 length, u32 initval)

Shouldn't these live in lib/jhash.c or something? Otherwise
everyone needs to depend on CONFIG_RHASHTABLE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Thomas Graf
On 07/16/15 at 02:15pm, Denys Vlasenko wrote:
 On 07/16/2015 12:41 PM, Thomas Graf wrote:
  On 07/16/15 at 12:02pm, Denys Vlasenko wrote:
  +/* jhash - hash an arbitrary key
  + * @k: sequence of bytes as key
  + * @length: the length of the key
  + * @initval: the previous hash, or an arbitray value
  + *
  + * The generic version, hashes an arbitrary sequence of bytes.
  + * No alignment or length assumptions are made about the input key.
  + *
  + * Returns the hash value of the key. The result depends on endianness.
  + */
  +u32 jhash(const void *key, u32 length, u32 initval)
  
  Shouldn't these live in lib/jhash.c or something? Otherwise
  everyone needs to depend on CONFIG_RHASHTABLE
 
 There is no CONFIG_RHASHTABLE, rhashtable.c is compiled unconditionally.
 
 I will send an alternative patch, which creates jhash.c;
 apply whichever version you like most.

Right. I misread the CONFIG_TEST_RHASHTABLE. I'm fine with this then
but agree with Daniel that this must be severely tested for
performance regressions.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] rhashtable: fix for resize events during table walk

2015-07-14 Thread Thomas Graf
On 07/15/15 at 12:35am, mr...@linux.ee wrote:
> Yes, this fixes the error, thank you.
> 
> The new problem with the test - soft lockup - CPU#0 stuck for 22s! is 
> still there on 360 MHz UltraSparc IIi. I understand it is harmless but 
> is there some easy way to make the test avoid NMI watchdog?
> 
> [   58.374173] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
> [swapper:1]

They don't show up on my system. I will add a schedule() call to the
long loops.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap()ed AF_NETLINK: lockdep and sleep-in-atomic warnings

2015-07-14 Thread Thomas Graf
On 07/13/15 at 10:11pm, Cong Wang wrote:
> Caused by:
> 
> commit 21e4902aea80ef35afc00ee8d2abdea4f519b7f7
> Author: Thomas Graf 
> Date:   Fri Jan 2 23:00:22 2015 +0100
> 
> netlink: Lockless lookup with RCU grace period in socket release
> 
> Defers the release of the socket reference using call_rcu() to
> allow using an RCU read-side protected call to rhashtable_lookup()
> 
> This restores behaviour and performance gains as previously
> introduced by e341694 ("netlink: Convert netlink_lookup() to use
> RCU protected hash table") without the side effect of severely
>     delayed socket destruction.
> 
> Signed-off-by: Thomas Graf 
> Signed-off-by: David S. Miller 
> 
> 
> We can't hold mutex lock in a rcu callback, perhaps we could
> defer the mmap ring cleanup to a workqueue.

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all. It would also get rid of all the special handling
for closing vs. non-closing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmap()ed AF_NETLINK: lockdep and sleep-in-atomic warnings

2015-07-14 Thread Thomas Graf
On 07/13/15 at 10:11pm, Cong Wang wrote:
 Caused by:
 
 commit 21e4902aea80ef35afc00ee8d2abdea4f519b7f7
 Author: Thomas Graf tg...@suug.ch
 Date:   Fri Jan 2 23:00:22 2015 +0100
 
 netlink: Lockless lookup with RCU grace period in socket release
 
 Defers the release of the socket reference using call_rcu() to
 allow using an RCU read-side protected call to rhashtable_lookup()
 
 This restores behaviour and performance gains as previously
 introduced by e341694 (netlink: Convert netlink_lookup() to use
 RCU protected hash table) without the side effect of severely
 delayed socket destruction.
 
 Signed-off-by: Thomas Graf tg...@suug.ch
 Signed-off-by: David S. Miller da...@davemloft.net
 
 
 We can't hold mutex lock in a rcu callback, perhaps we could
 defer the mmap ring cleanup to a workqueue.

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all. It would also get rid of all the special handling
for closing vs. non-closing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] rhashtable: fix for resize events during table walk

2015-07-14 Thread Thomas Graf
On 07/15/15 at 12:35am, mr...@linux.ee wrote:
 Yes, this fixes the error, thank you.
 
 The new problem with the test - soft lockup - CPU#0 stuck for 22s! is 
 still there on 360 MHz UltraSparc IIi. I understand it is harmless but 
 is there some easy way to make the test avoid NMI watchdog?
 
 [   58.374173] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
 [swapper:1]

They don't show up on my system. I will add a schedule() call to the
long loops.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-02 Thread Thomas Graf
On 07/01/15 at 01:21pm, Meelis Roos wrote:
> This is 4.1 on sparc64 - one of my boxes that happens to have most 
> runtime test left on from some debugging effort. In 4.0 it was fine, 4.1 
> gives this in dmesg:
> 
> [   31.898697] Running resizable hashtable tests...
> [   31.898915]   Adding 2048 keys
> [   31.952911]   Traversal complete: counted=17, nelems=2048, entries=2048
> [   31.953004] Test failed: Total count mismatch ^^^
> [   32.022676]   Traversal complete: counted=17, nelems=2048, entries=2048
> [   32.022788] Test failed: Total count mismatch ^^^
> [   32.022828]   Deleting 2048 keys

Thanks for the report. I think this is already fixed. Can you try with the
following commit:

commit 246b23a7695bd5a457aa51a36a948cce53d1d477
Author: Thomas Graf 
Date:   Thu Apr 30 22:37:44 2015 +

rhashtable-test: Use walker to test bucket statistics

As resizes may continue to run in the background, use walker to
ensure we see all entries. Also print the encountered number
of rehashes queued up while traversing.

This may lead to warnings due to entries being seen multiple
times. We consider them non-fatal.

Signed-off-by: Thomas Graf 
Signed-off-by: David S. Miller 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-02 Thread Thomas Graf
On 07/01/15 at 01:21pm, Meelis Roos wrote:
 This is 4.1 on sparc64 - one of my boxes that happens to have most 
 runtime test left on from some debugging effort. In 4.0 it was fine, 4.1 
 gives this in dmesg:
 
 [   31.898697] Running resizable hashtable tests...
 [   31.898915]   Adding 2048 keys
 [   31.952911]   Traversal complete: counted=17, nelems=2048, entries=2048
 [   31.953004] Test failed: Total count mismatch ^^^
 [   32.022676]   Traversal complete: counted=17, nelems=2048, entries=2048
 [   32.022788] Test failed: Total count mismatch ^^^
 [   32.022828]   Deleting 2048 keys

Thanks for the report. I think this is already fixed. Can you try with the
following commit:

commit 246b23a7695bd5a457aa51a36a948cce53d1d477
Author: Thomas Graf tg...@suug.ch
Date:   Thu Apr 30 22:37:44 2015 +

rhashtable-test: Use walker to test bucket statistics

As resizes may continue to run in the background, use walker to
ensure we see all entries. Also print the encountered number
of rehashes queued up while traversing.

This may lead to warnings due to entries being seen multiple
times. We consider them non-fatal.

Signed-off-by: Thomas Graf tg...@suug.ch
Signed-off-by: David S. Miller da...@davemloft.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rhashtable] [ INFO: suspicious RCU usage. ]

2015-04-02 Thread Thomas Graf
On 04/02/15 at 12:29pm, Herbert Xu wrote:
> On Thu, Apr 02, 2015 at 12:11:35PM +0800, Fengguang Wu wrote:
> > 
> > Yes it is contained in next-20150401 which is bad:
> > 
> > # extra tests on tree/branch next/master
> > git bisect  bad e954104e2b634b42811dad8d502cbf240f206df2  # 21:22  0-   
> >   60  Add linux-next specific files for 20150401
> > 
> > The dmesg there is
> > 
> > [1.149409] test_firmware: interface ready
> > [1.150293] Running resizable hashtable tests...
> > [1.151209]   Adding 2048 keys
> > [1.152069] [ cut here ]
> > [1.152978] WARNING: CPU: 0 PID: 1 at lib/rhashtable.c:409 
> > rhashtable_insert_rehash+0x9d/0x1d0()
> 
> I see.  This is actually a completely different problem.
> 
> ---8<---
> test_rhashtable: Remove bogus max_size setting
> 
> Now that resizing is completely automatic, we need to remove
> the max_size setting or the test will fail.
> 
> Reported-by: Fengguang Wu 
> Signed-off-by: Herbert Xu 

Acked-by: Thomas Graf 

Had the same fix queued up in an upcoming series ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rhashtable] [ INFO: suspicious RCU usage. ]

2015-04-02 Thread Thomas Graf
On 04/02/15 at 12:29pm, Herbert Xu wrote:
 On Thu, Apr 02, 2015 at 12:11:35PM +0800, Fengguang Wu wrote:
  
  Yes it is contained in next-20150401 which is bad:
  
  # extra tests on tree/branch next/master
  git bisect  bad e954104e2b634b42811dad8d502cbf240f206df2  # 21:22  0-   
60  Add linux-next specific files for 20150401
  
  The dmesg there is
  
  [1.149409] test_firmware: interface ready
  [1.150293] Running resizable hashtable tests...
  [1.151209]   Adding 2048 keys
  [1.152069] [ cut here ]
  [1.152978] WARNING: CPU: 0 PID: 1 at lib/rhashtable.c:409 
  rhashtable_insert_rehash+0x9d/0x1d0()
 
 I see.  This is actually a completely different problem.
 
 ---8---
 test_rhashtable: Remove bogus max_size setting
 
 Now that resizing is completely automatic, we need to remove
 the max_size setting or the test will fail.
 
 Reported-by: Fengguang Wu fengguang...@intel.com
 Signed-off-by: Herbert Xu herb...@gondor.apana.org.au

Acked-by: Thomas Graf tg...@suug.ch

Had the same fix queued up in an upcoming series ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >