Re: [PATCH v5] audit: log nftables configuration change events once per table

2021-04-01 Thread Phil Sutter
On Fri, Mar 26, 2021 at 01:38:59PM -0400, Richard Guy Briggs wrote:
> Reduce logging of nftables events to a level similar to iptables.
> Restore the table field to list the table, adding the generation.
> 
> Indicate the op as the most significant operation in the event.
> 
> A couple of sample events:
> 
> type=PROCTITLE msg=audit(2021-03-18 09:30:49.801:143) : 
> proctitle=/usr/bin/python3 -s /usr/sbin/firewalld --nofork --nopid
> type=SYSCALL msg=audit(2021-03-18 09:30:49.801:143) : arch=x86_64 
> syscall=sendmsg success=yes exit=172 a0=0x6 a1=0x7ffdcfcbe650 a2=0x0 
> a3=0x7ffdcfcbd52c items=0 ppid=1 pid=367 auid=unset uid=root gid=root 
> euid=root suid=root fsuid=root egid=roo
> t sgid=root fsgid=root tty=(none) ses=unset comm=firewalld 
> exe=/usr/bin/python3.9 subj=system_u:system_r:firewalld_t:s0 key=(null)
> type=NETFILTER_CFG msg=audit(2021-03-18 09:30:49.801:143) : table=firewalld:2 
> family=ipv6 entries=1 op=nft_register_table pid=367 
> subj=system_u:system_r:firewalld_t:s0 comm=firewalld
> type=NETFILTER_CFG msg=audit(2021-03-18 09:30:49.801:143) : table=firewalld:2 
> family=ipv4 entries=1 op=nft_register_table pid=367 
> subj=system_u:system_r:firewalld_t:s0 comm=firewalld
> type=NETFILTER_CFG msg=audit(2021-03-18 09:30:49.801:143) : table=firewalld:2 
> family=inet entries=1 op=nft_register_table pid=367 
> subj=system_u:system_r:firewalld_t:s0 comm=firewalld
> 
> type=PROCTITLE msg=audit(2021-03-18 09:30:49.839:144) : 
> proctitle=/usr/bin/python3 -s /usr/sbin/firewalld --nofork --nopid
> type=SYSCALL msg=audit(2021-03-18 09:30:49.839:144) : arch=x86_64 
> syscall=sendmsg success=yes exit=22792 a0=0x6 a1=0x7ffdcfcbe650 a2=0x0 
> a3=0x7ffdcfcbd52c items=0 ppid=1 pid=367 auid=unset uid=root gid=root 
> euid=root suid=root fsuid=root egid=r
> oot sgid=root fsgid=root tty=(none) ses=unset comm=firewalld 
> exe=/usr/bin/python3.9 subj=system_u:system_r:firewalld_t:s0 key=(null)
> type=NETFILTER_CFG msg=audit(2021-03-18 09:30:49.839:144) : table=firewalld:3 
> family=ipv6 entries=30 op=nft_register_chain pid=367 
> subj=system_u:system_r:firewalld_t:s0 comm=firewalld
> type=NETFILTER_CFG msg=audit(2021-03-18 09:30:49.839:144) : table=firewalld:3 
> family=ipv4 entries=30 op=nft_register_chain pid=367 
> subj=system_u:system_r:firewalld_t:s0 comm=firewalld
> type=NETFILTER_CFG msg=audit(2021-03-18 09:30:49.839:144) : table=firewalld:3 
> family=inet entries=165 op=nft_register_chain pid=367 
> subj=system_u:system_r:firewalld_t:s0 comm=firewalld
> 
> The issue was originally documented in
> https://github.com/linux-audit/audit-kernel/issues/124
> 
> Signed-off-by: Richard Guy Briggs 

Tested this patch to make sure it eliminates the slowdown of
iptables-nft when auditd is running. With this applied, neither
iptables-nft-restore nor 'iptables-nft -F' show a significant
difference in run-time between running or stopped auditd, at least for
large rulesets. Individual calls suffer from added audit logging, but
that's expected of course.

Tested-by: Phil Sutter 

Thanks, Phil


Re: [PATCH] audit: log nftables configuration change events once per table

2021-03-19 Thread Phil Sutter
On Thu, Mar 18, 2021 at 02:37:03PM -0400, Richard Guy Briggs wrote:
> On 2021-03-18 17:30, Phil Sutter wrote:
[...]
> > Why did you leave the object-related logs in place? They should reappear
> > at commit time just like chains and sets for instance, no?
> 
> There are other paths that can trigger these messages that don't go
> through nf_tables_commit() that affect the configuration data.  The
> counters are considered config data for auditing purposes and the act of
> resetting them is audittable.  And the only time we want to emit a
> record is when they are being reset.

Oh, I see. I wasn't aware 'nft reset' bypasses the transaction logic,
thanks for clarifying!

Cheers, Phil


Re: [PATCH] audit: log nftables configuration change events once per table

2021-03-18 Thread Phil Sutter
Hi,

On Thu, Mar 18, 2021 at 11:39:52AM -0400, Richard Guy Briggs wrote:
> Reduce logging of nftables events to a level similar to iptables.
> Restore the table field to list the table, adding the generation.

This looks much better, a few remarks below:

[...]
> +static const u8 nft2audit_op[] = { // enum nf_tables_msg_types
> + /* NFT_MSG_NEWTABLE */  AUDIT_NFT_OP_TABLE_REGISTER,
> + /* NFT_MSG_GETTABLE */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_DELTABLE */  AUDIT_NFT_OP_TABLE_UNREGISTER,
> + /* NFT_MSG_NEWCHAIN */  AUDIT_NFT_OP_CHAIN_REGISTER,
> + /* NFT_MSG_GETCHAIN */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_DELCHAIN */  AUDIT_NFT_OP_CHAIN_UNREGISTER,
> + /* NFT_MSG_NEWRULE  */  AUDIT_NFT_OP_RULE_REGISTER,
> + /* NFT_MSG_GETRULE  */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_DELRULE  */  AUDIT_NFT_OP_RULE_UNREGISTER,
> + /* NFT_MSG_NEWSET   */  AUDIT_NFT_OP_SET_REGISTER,
> + /* NFT_MSG_GETSET   */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_DELSET   */  AUDIT_NFT_OP_SET_UNREGISTER,
> + /* NFT_MSG_NEWSETELEM   */  AUDIT_NFT_OP_SETELEM_REGISTER,
> + /* NFT_MSG_GETSETELEM   */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_DELSETELEM   */  AUDIT_NFT_OP_SETELEM_UNREGISTER,
> + /* NFT_MSG_NEWGEN   */  AUDIT_NFT_OP_GEN_REGISTER,
> + /* NFT_MSG_GETGEN   */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_TRACE*/  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_NEWOBJ   */  AUDIT_NFT_OP_OBJ_REGISTER,
> + /* NFT_MSG_GETOBJ   */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_DELOBJ   */  AUDIT_NFT_OP_OBJ_UNREGISTER,
> + /* NFT_MSG_GETOBJ_RESET */  AUDIT_NFT_OP_OBJ_RESET,
> + /* NFT_MSG_NEWFLOWTABLE */  AUDIT_NFT_OP_FLOWTABLE_REGISTER,
> + /* NFT_MSG_GETFLOWTABLE */  AUDIT_NFT_OP_INVALID,
> + /* NFT_MSG_DELFLOWTABLE */  AUDIT_NFT_OP_FLOWTABLE_UNREGISTER,
> + /* NFT_MSG_MAX  */  AUDIT_NFT_OP_INVALID,
> +};

NFT_MSG_MAX is itself not a valid message, it serves merely as an upper
bound for arrays, loops or sanity checks. You will never see it in
trans->msg_type.

Since enum nf_tables_msg_types contains consecutive values from 0 to
NFT_MSG_MAX, you could write the above more explicitly:

| static const u8 nft2audit_op[NFT_MSG_MAX] = {
|   [NFT_MSG_NEWTABLE]  = AUDIT_NFT_OP_TABLE_REGISTER,
|   [NFT_MSG_GETTABLE]  = AUDIT_NFT_OP_INVALID,
|   [NFT_MSG_DELTABLE]  = AUDIT_NFT_OP_TABLE_UNREGISTER,
(And so forth.)

Not a must, but it clarifies the 1:1 mapping between index and said
enum. Sadly, AUDIT_NFT_OP_INVALID is non-zero. Otherwise one could skip
all uninteresting ones.

[...]
> @@ -6278,12 +6219,11 @@ static int nf_tables_dump_obj(struct sk_buff *skb, 
> struct netlink_callback *cb)
>   filter->type != NFT_OBJECT_UNSPEC &&
>   obj->ops->type->type != filter->type)
>   goto cont;
> -
>   if (reset) {
>   char *buf = kasprintf(GFP_ATOMIC,
> -   "%s:%llu;?:0",
> +   "%s:%u",
> table->name,
> -   table->handle);
> +   net->nft.base_seq);
>  
>   audit_log_nfcfg(buf,
>   family,

Why did you leave the object-related logs in place? They should reappear
at commit time just like chains and sets for instance, no?

Thanks, Phil


Re: [PATCH ghak124 v3] audit: log nftables configuration change events

2021-02-12 Thread Phil Sutter
Hi,

On Thu, Feb 11, 2021 at 04:02:55PM -0500, Steve Grubb wrote:
> On Thursday, February 11, 2021 11:29:34 AM EST Paul Moore wrote:
> > > If I'm not mistaken, iptables emits a single audit log per table, ipset
> > > doesn't support audit at all. So I wonder how much audit logging is
> > > required at all (for certification or whatever reason). How much
> > > granularity is desired?
>  
>
> 
> > I believe the netfilter auditing was mostly a nice-to-have bit of
> > functionality to help add to the completeness of the audit logs, but I
> > could very easily be mistaken.  Richard put together those patches, he
> > can probably provide the background/motivation for the effort.
> 
> There are certifications which levy requirements on information flow control. 
> The firewall can decide if information should flow or be blocked. Information 
> flow decisions need to be auditable - which we have with the audit target. 

In nftables, this is realized via 'log level audit' statement.
Functionality should by all means be identical to that of xtables' AUDIT
target.

> That then swings in requirements on the configuration of the information flow 
> policy.
> 
> The requirements state a need to audit any management activity - meaning the 
> creation, modification, and/or deletion of a "firewall ruleset". Because it 
> talks constantly about a ruleset and then individual rules, I suspect only 1 
> summary event is needed to say something happened, who did it, and the 
> outcome. This would be in line with how selinux is treated: we have 1 summary 
> event for loading/modifying/unloading selinux policy.

So the central element are firewall rules for audit purposes and
NETFILTER_CFG notifications merely serve asserting changes to those
rules are noticed by the auditing system. Looking at xtables again, this
seems coherent: Any change causes the whole table blob to be replaced
(while others stay in place). So table replace/create is the most common
place for a change notification. In nftables, the most common one is
generation dump - all tables are treated as elements of the same
ruleset, not individually like in xtables.

Richard, assuming the above is correct, are you fine with reducing
nftables auditing to a single notification per transaction then? I guess
Florian sufficiently illustrated how this would be implemented.

> Hope this helps...

It does, thanks a lot for the information!

Thanks, Phil


Re: [PATCH ghak124 v3] audit: log nftables configuration change events

2021-02-11 Thread Phil Sutter
Hi,

On Thu, Jun 04, 2020 at 09:20:49AM -0400, Richard Guy Briggs wrote:
> iptables, ip6tables, arptables and ebtables table registration,
> replacement and unregistration configuration events are logged for the
> native (legacy) iptables setsockopt api, but not for the
> nftables netlink api which is used by the nft-variant of iptables in
> addition to nftables itself.
> 
> Add calls to log the configuration actions in the nftables netlink api.

As discussed offline already, these audit notifications are pretty hefty
performance-wise. In an internal report, 300% restore time of a ruleset
containing 70k set elements is measured.

If I'm not mistaken, iptables emits a single audit log per table, ipset
doesn't support audit at all. So I wonder how much audit logging is
required at all (for certification or whatever reason). How much
granularity is desired?

I personally would notify once per transaction. This is easy and quick.
Once per table or chain should be acceptable, as well. At the very
least, we should not have to notify once per each element. This is the
last resort of fast ruleset adjustments. If we lose it, people are
better off with ipset IMHO.

Unlike nft monitor, auditd is not designed to be disabled "at will". So
turning it off for performance-critical workloads is no option.

Cheers, Phil


Re: netfilter: does the API break or something else ?

2020-05-14 Thread Phil Sutter
Hi,

On Wed, May 13, 2020 at 11:20:35PM +0800, Xiubo Li wrote:
> Recently I hit one netfilter issue, it seems the API breaks or something 
> else.

Just for the record, this was caused by a misconfigured kernel.

Cheers, Phil


Re: linux-next: Signed-off-by missing for commit in the netfilter tree

2019-07-24 Thread Phil Sutter
Hi,

On Thu, Jul 25, 2019 at 07:18:03AM +1000, Stephen Rothwell wrote:
> Commit
> 
>   5f5ff5ca2e18 ("netfilter: nf_tables: Make nft_meta expression more robust")
> 
> is missing a Signed-off-by from its author.

Argh, my SoB ended in the changelog I put below the commit message and
hence was dropped during git-am. Thanks for the heads-up, Stephen!

Pablo, can we fix this somehow?

Sorry for the mess, Phil


Re: [PATCH] test_rhashtable: remove semaphore usage

2018-12-10 Thread Phil Sutter
Hi,

On Tue, Dec 11, 2018 at 01:45:52PM +0800, Herbert Xu wrote:
> On Mon, Dec 10, 2018 at 10:17:20PM +0100, Arnd Bergmann wrote:
> > This is one of only two files that initialize a semaphore to a negative
> > value. We don't really need the two semaphores here at all, but can do
> > the same thing in more conventional and more effient way, by using a
> > single waitqueue and an atomic thread counter.
> > 
> > This gets us a little bit closer to eliminating classic semaphores from
> > the kernel. It also fixes a corner case where we fail to continue after
> > one of the threads fails to start up.
> > 
> > An alternative would be to use a split kthread_create()+wake_up_process()
> > and completely eliminate the separate synchronization.
> > 
> > Signed-off-by: Arnd Bergmann 
> > ---
> > This is part of a longer, untested, series to remove semaphores
> > from the kernel, please review as such before applying.
> > ---
> >  lib/test_rhashtable.c | 28 ++++++++
> >  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> This was created by Phil Sutter so I am adding him to the cc list.

Thanks, I would have missed it otherwise. Just a minor nit:

> > diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> > index 18de5ff1255b..12bdea4f6c20 100644
> > --- a/lib/test_rhashtable.c
> > +++ b/lib/test_rhashtable.c
[...]
> > @@ -635,8 +636,9 @@ static int threadfunc(void *data)
> > int i, step, err = 0, insert_retries = 0;
> > struct thread_data *tdata = data;
> >  
> > -   up(_sem);
> > -   if (down_interruptible(_sem))
> > +   if (atomic_dec_and_test(_count))
> > +   wake_up(_wait);
> > +   if (wait_event_interruptible(startup_wait, atomic_read(_count) 
> > == -1))
> > pr_err("  thread[%d]: down_interruptible failed\n", tdata->id);

The error message should probably be adjusted as well. Apart from that:

Acked-by: Phil Sutter 

Thanks, Phil


Re: [PATCH 2/2] iproute2: add support for invisible qdisc dumping

2017-02-27 Thread Phil Sutter
On Sat, Feb 25, 2017 at 10:29:17PM +0100, Jiri Kosina wrote:
> From: Jiri Kosina 
> 
> Support the new TCA_DUMP_INVISIBLE netlink attribute that allows asking 
> kernel to perform 'full qdisc dump', as for historical reasons some of the 
> default qdiscs are being hidden by the kernel.
> 
> The command syntax is being extended by voluntary 'invisible' argument to
> 'tc qdisc show'.
> 
> Signed-off-by: Jiri Kosina 
> ---
>  tc/tc_qdisc.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)

Would you mind adding a description of the new keyword to tc man page as
well?

> diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
> index 3a3701c2..29da9269 100644
> --- a/tc/tc_qdisc.c
> +++ b/tc/tc_qdisc.c
> @@ -34,7 +34,7 @@ static int usage(void)
>   fprintf(stderr, "   [ stab [ help | STAB_OPTIONS] ]\n");
>   fprintf(stderr, "   [ [ QDISC_KIND ] [ help | OPTIONS ] ]\n");
>   fprintf(stderr, "\n");
> - fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
> ]\n");
> + fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
> | invisible ]\n");

Doesn't look like these are mutually exclusive. Therefore I would
suggest fixing the syntax to:

| + fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
] [ invisible ]\n");

Cheers, Phil


Re: [PATCH 2/2] iproute2: add support for invisible qdisc dumping

2017-02-27 Thread Phil Sutter
On Sat, Feb 25, 2017 at 10:29:17PM +0100, Jiri Kosina wrote:
> From: Jiri Kosina 
> 
> Support the new TCA_DUMP_INVISIBLE netlink attribute that allows asking 
> kernel to perform 'full qdisc dump', as for historical reasons some of the 
> default qdiscs are being hidden by the kernel.
> 
> The command syntax is being extended by voluntary 'invisible' argument to
> 'tc qdisc show'.
> 
> Signed-off-by: Jiri Kosina 
> ---
>  tc/tc_qdisc.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)

Would you mind adding a description of the new keyword to tc man page as
well?

> diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
> index 3a3701c2..29da9269 100644
> --- a/tc/tc_qdisc.c
> +++ b/tc/tc_qdisc.c
> @@ -34,7 +34,7 @@ static int usage(void)
>   fprintf(stderr, "   [ stab [ help | STAB_OPTIONS] ]\n");
>   fprintf(stderr, "   [ [ QDISC_KIND ] [ help | OPTIONS ] ]\n");
>   fprintf(stderr, "\n");
> - fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
> ]\n");
> + fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
> | invisible ]\n");

Doesn't look like these are mutually exclusive. Therefore I would
suggest fixing the syntax to:

| + fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
] [ invisible ]\n");

Cheers, Phil


Re: Deleting child qdisc doesn't reset parent to default qdisc?

2016-04-14 Thread Phil Sutter
On Thu, Apr 14, 2016 at 08:44:40AM -0700, Eric Dumazet wrote:
> On Thu, 2016-04-14 at 17:34 +0200, Jiri Kosina wrote:
> > On Thu, 14 Apr 2016, Phil Sutter wrote:
> > 
> > > OTOH some qdiscs (CBQ, DRR, DSMARK, HFSC, HTB, QFQ) assign the default
> > > one upon deletion instead of noop_qdisc, hence I would describe
> > > the situation using the words 'inconsistent' and 'accident' rather than
> > > 'expected'. :)
> > 
> > Exactly. I'd again like to stress the fact that this configuration works:
> > 
> > jikos:~ # tc qdisc show
> > qdisc tbf 10: dev eth0 root refcnt 2 rate 800Mbit burst 131000b lat 
> > 1.0ms 
> > 
> > and this (after performing add/delete operation) doesn't:
> > 
> > jikos:~ # tc qdisc show
> > qdisc tbf 10: dev eth0 root refcnt 2 rate 800Mbit burst 131000b lat 
> > 1.0ms 
> > 
> > It's hard to spot a difference (hint: there is none).
> 
> This is because some qdisc are not visible in the dump.

And those being invisible can be overridden using 'tc qd add', right?
AFAIR they're not listed because they don't properly register, so the
system doesn't care to override them. In this case we could change all
classful qdiscs to restore the default qdisc if a leaf qdisc is being
deleted instead of noop (which is probably not what the user wants
anyway).

Cheers, Phil


Re: Deleting child qdisc doesn't reset parent to default qdisc?

2016-04-14 Thread Phil Sutter
On Thu, Apr 14, 2016 at 08:44:40AM -0700, Eric Dumazet wrote:
> On Thu, 2016-04-14 at 17:34 +0200, Jiri Kosina wrote:
> > On Thu, 14 Apr 2016, Phil Sutter wrote:
> > 
> > > OTOH some qdiscs (CBQ, DRR, DSMARK, HFSC, HTB, QFQ) assign the default
> > > one upon deletion instead of noop_qdisc, hence I would describe
> > > the situation using the words 'inconsistent' and 'accident' rather than
> > > 'expected'. :)
> > 
> > Exactly. I'd again like to stress the fact that this configuration works:
> > 
> > jikos:~ # tc qdisc show
> > qdisc tbf 10: dev eth0 root refcnt 2 rate 800Mbit burst 131000b lat 
> > 1.0ms 
> > 
> > and this (after performing add/delete operation) doesn't:
> > 
> > jikos:~ # tc qdisc show
> > qdisc tbf 10: dev eth0 root refcnt 2 rate 800Mbit burst 131000b lat 
> > 1.0ms 
> > 
> > It's hard to spot a difference (hint: there is none).
> 
> This is because some qdisc are not visible in the dump.

And those being invisible can be overridden using 'tc qd add', right?
AFAIR they're not listed because they don't properly register, so the
system doesn't care to override them. In this case we could change all
classful qdiscs to restore the default qdisc if a leaf qdisc is being
deleted instead of noop (which is probably not what the user wants
anyway).

Cheers, Phil


Re: Deleting child qdisc doesn't reset parent to default qdisc?

2016-04-14 Thread Phil Sutter
On Thu, Apr 14, 2016 at 08:01:39AM -0700, Eric Dumazet wrote:
> On Thu, 2016-04-14 at 16:44 +0200, Jiri Kosina wrote:
> > Hi,
> > 
> > I've came across the behavior where adding a child qdisc and then deleting 
> > it again makes the networking dysfunctional (I guess that's because all of 
> > a sudden there is absolutely no working qdisc on the device, although 
> > there originally was a default one in the parent).
> > 
> > In a nutshell, is this expected behavior or bug?
> 
> This is the expected behavior.

OTOH some qdiscs (CBQ, DRR, DSMARK, HFSC, HTB, QFQ) assign the default
one upon deletion instead of noop_qdisc, hence I would describe
the situation using the words 'inconsistent' and 'accident' rather than
'expected'. :)

Anyhow, the problem with skilled admins is they accept quirks too easily
and just build their scripts around them - the same scripts we have to
keep compatible to then.

Cheers, Phil


Re: Deleting child qdisc doesn't reset parent to default qdisc?

2016-04-14 Thread Phil Sutter
On Thu, Apr 14, 2016 at 08:01:39AM -0700, Eric Dumazet wrote:
> On Thu, 2016-04-14 at 16:44 +0200, Jiri Kosina wrote:
> > Hi,
> > 
> > I've came across the behavior where adding a child qdisc and then deleting 
> > it again makes the networking dysfunctional (I guess that's because all of 
> > a sudden there is absolutely no working qdisc on the device, although 
> > there originally was a default one in the parent).
> > 
> > In a nutshell, is this expected behavior or bug?
> 
> This is the expected behavior.

OTOH some qdiscs (CBQ, DRR, DSMARK, HFSC, HTB, QFQ) assign the default
one upon deletion instead of noop_qdisc, hence I would describe
the situation using the words 'inconsistent' and 'accident' rather than
'expected'. :)

Anyhow, the problem with skilled admins is they accept quirks too easily
and just build their scripts around them - the same scripts we have to
keep compatible to then.

Cheers, Phil


Re: [PATCH net] tuntap: restore default qdisc

2016-04-08 Thread Phil Sutter
On Fri, Apr 08, 2016 at 01:26:48PM +0800, Jason Wang wrote:
> After commit f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using
> alloc_netdev"), default qdisc was changed to noqueue because
> tuntap does not set tx_queue_len during .setup(). This patch restores
> default qdisc by setting tx_queue_len in tun_setup().
> 
> Fixes: f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev")
> Cc: Phil Sutter <p...@nwl.cc>
> Signed-off-by: Jason Wang <jasow...@redhat.com>

Acked-by: Phil Sutter <p...@nwl.cc>


Re: [PATCH net] tuntap: restore default qdisc

2016-04-08 Thread Phil Sutter
On Fri, Apr 08, 2016 at 01:26:48PM +0800, Jason Wang wrote:
> After commit f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using
> alloc_netdev"), default qdisc was changed to noqueue because
> tuntap does not set tx_queue_len during .setup(). This patch restores
> default qdisc by setting tx_queue_len in tun_setup().
> 
> Fixes: f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev")
> Cc: Phil Sutter 
> Signed-off-by: Jason Wang 

Acked-by: Phil Sutter 


[net PATCH] IFF_NO_QUEUE: Fix for drivers not calling ether_setup()

2016-02-17 Thread Phil Sutter
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.

Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.

Fixes: 348e3435cbefa ("net: sched: drop all special handling of tx_queue_len == 
0")
Tested-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Signed-off-by: Phil Sutter <p...@nwl.cc>
---
 net/core/dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 3f4071a84a03f..75383f40e7ced 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7440,8 +7440,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
setup(dev);
 
-   if (!dev->tx_queue_len)
+   if (!dev->tx_queue_len) {
dev->priv_flags |= IFF_NO_QUEUE;
+   dev->tx_queue_len = 1;
+   }
 
dev->num_tx_queues = txqs;
dev->real_num_tx_queues = txqs;
-- 
2.7.0



[net PATCH] IFF_NO_QUEUE: Fix for drivers not calling ether_setup()

2016-02-17 Thread Phil Sutter
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.

Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.

Fixes: 348e3435cbefa ("net: sched: drop all special handling of tx_queue_len == 
0")
Tested-by: Mathieu Desnoyers 
Signed-off-by: Phil Sutter 
---
 net/core/dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 3f4071a84a03f..75383f40e7ced 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7440,8 +7440,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
setup(dev);
 
-   if (!dev->tx_queue_len)
+   if (!dev->tx_queue_len) {
dev->priv_flags |= IFF_NO_QUEUE;
+   dev->tx_queue_len = 1;
+   }
 
dev->num_tx_queues = txqs;
dev->real_num_tx_queues = txqs;
-- 
2.7.0



Re: [PATCH] Revert "net: sched: drop all special handling of tx_queue_len == 0"

2016-02-17 Thread Phil Sutter
On Wed, Feb 17, 2016 at 01:57:42PM +, Mathieu Desnoyers wrote:
> - On Feb 17, 2016, at 7:47 AM, Phil Sutter p...@nwl.cc wrote:
> 
> > Hi,
> > 
> > On Tue, Feb 16, 2016 at 07:56:23PM -0500, Mathieu Desnoyers wrote:
> >> This reverts commit 348e3435cbefa815bd56a5205c1412b5afe7b92e.
> >> It breaks HTB classful qdiscs on the loopback interface.
> >> 
> >> It has been broken since kernel v4.2. The offending commit has
> >> been identified by bissection of the issue with the following
> >> test-case. It appears that the loopback interface does indeed
> >> still have tx_queue_len == 0.
> >> 
> >> Reverting the commit on a v4.4.1 kernel fixes the issue.
> > 
> > Indeed, this is ugly. Affected are all drivers not calling ether_setup()
> > in their setup callback, and therefore not initializing tx_queue_len.
> > 
> > As the commit to be reverted shows, there is no common fallback value
> > for tx_queue_len - most qdisc implementations used 1, but HTB fell back
> > to 2 and PLUG to 100. But as the removed comment in sch_plug.c stated:
> > 
> >| /* We will set a default limit of 100 pkts (~150kB)
> >|  * in case tx_queue_len is not available. The
> >|  * default value is completely arbitrary.
> >|  */
> > 
> > It seems not to be overly important to fallback to the exact value each
> > qdisc had before. Therefore I guess the following change should
> > appropriately fix the issue at hand:
> > 
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -7440,8 +7440,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv,
> > const char *name,
> >dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
> >setup(dev);
> > 
> > -   if (!dev->tx_queue_len)
> > +   if (!dev->tx_queue_len) {
> >dev->priv_flags |= IFF_NO_QUEUE;
> > +   dev->tx_queue_len = 1;
> > +   }
> > 
> >dev->num_tx_queues = txqs;
> >dev->real_num_tx_queues = txqs;
> > 
> > Unless there is concern, I will formally submit this later.
> 
> Works for me!
> 
> Tested-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>

Cool, thanks for testing!

Cheers, Phil


Re: [PATCH] Revert "net: sched: drop all special handling of tx_queue_len == 0"

2016-02-17 Thread Phil Sutter
On Wed, Feb 17, 2016 at 01:57:42PM +, Mathieu Desnoyers wrote:
> - On Feb 17, 2016, at 7:47 AM, Phil Sutter p...@nwl.cc wrote:
> 
> > Hi,
> > 
> > On Tue, Feb 16, 2016 at 07:56:23PM -0500, Mathieu Desnoyers wrote:
> >> This reverts commit 348e3435cbefa815bd56a5205c1412b5afe7b92e.
> >> It breaks HTB classful qdiscs on the loopback interface.
> >> 
> >> It has been broken since kernel v4.2. The offending commit has
> >> been identified by bissection of the issue with the following
> >> test-case. It appears that the loopback interface does indeed
> >> still have tx_queue_len == 0.
> >> 
> >> Reverting the commit on a v4.4.1 kernel fixes the issue.
> > 
> > Indeed, this is ugly. Affected are all drivers not calling ether_setup()
> > in their setup callback, and therefore not initializing tx_queue_len.
> > 
> > As the commit to be reverted shows, there is no common fallback value
> > for tx_queue_len - most qdisc implementations used 1, but HTB fell back
> > to 2 and PLUG to 100. But as the removed comment in sch_plug.c stated:
> > 
> >| /* We will set a default limit of 100 pkts (~150kB)
> >|  * in case tx_queue_len is not available. The
> >|  * default value is completely arbitrary.
> >|  */
> > 
> > It seems not to be overly important to fallback to the exact value each
> > qdisc had before. Therefore I guess the following change should
> > appropriately fix the issue at hand:
> > 
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -7440,8 +7440,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv,
> > const char *name,
> >dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
> >setup(dev);
> > 
> > -   if (!dev->tx_queue_len)
> > +   if (!dev->tx_queue_len) {
> >dev->priv_flags |= IFF_NO_QUEUE;
> > +   dev->tx_queue_len = 1;
> > +   }
> > 
> >dev->num_tx_queues = txqs;
> >dev->real_num_tx_queues = txqs;
> > 
> > Unless there is concern, I will formally submit this later.
> 
> Works for me!
> 
> Tested-by: Mathieu Desnoyers 

Cool, thanks for testing!

Cheers, Phil


Re: [PATCH] Revert "net: sched: drop all special handling of tx_queue_len == 0"

2016-02-17 Thread Phil Sutter
Hi,

On Tue, Feb 16, 2016 at 07:56:23PM -0500, Mathieu Desnoyers wrote:
> This reverts commit 348e3435cbefa815bd56a5205c1412b5afe7b92e.
> It breaks HTB classful qdiscs on the loopback interface.
> 
> It has been broken since kernel v4.2. The offending commit has
> been identified by bissection of the issue with the following
> test-case. It appears that the loopback interface does indeed
> still have tx_queue_len == 0.
> 
> Reverting the commit on a v4.4.1 kernel fixes the issue.

Indeed, this is ugly. Affected are all drivers not calling ether_setup()
in their setup callback, and therefore not initializing tx_queue_len.

As the commit to be reverted shows, there is no common fallback value
for tx_queue_len - most qdisc implementations used 1, but HTB fell back
to 2 and PLUG to 100. But as the removed comment in sch_plug.c stated:

| /* We will set a default limit of 100 pkts (~150kB)
|  * in case tx_queue_len is not available. The
|  * default value is completely arbitrary.
|  */

It seems not to be overly important to fallback to the exact value each
qdisc had before. Therefore I guess the following change should
appropriately fix the issue at hand:

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7440,8 +7440,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
setup(dev);
 
-   if (!dev->tx_queue_len)
+   if (!dev->tx_queue_len) {
dev->priv_flags |= IFF_NO_QUEUE;
+   dev->tx_queue_len = 1;
+   }
 
dev->num_tx_queues = txqs;
dev->real_num_tx_queues = txqs;

Unless there is concern, I will formally submit this later.

Thanks, Phil


Re: [PATCH] Revert "net: sched: drop all special handling of tx_queue_len == 0"

2016-02-17 Thread Phil Sutter
Hi,

On Tue, Feb 16, 2016 at 07:56:23PM -0500, Mathieu Desnoyers wrote:
> This reverts commit 348e3435cbefa815bd56a5205c1412b5afe7b92e.
> It breaks HTB classful qdiscs on the loopback interface.
> 
> It has been broken since kernel v4.2. The offending commit has
> been identified by bissection of the issue with the following
> test-case. It appears that the loopback interface does indeed
> still have tx_queue_len == 0.
> 
> Reverting the commit on a v4.4.1 kernel fixes the issue.

Indeed, this is ugly. Affected are all drivers not calling ether_setup()
in their setup callback, and therefore not initializing tx_queue_len.

As the commit to be reverted shows, there is no common fallback value
for tx_queue_len - most qdisc implementations used 1, but HTB fell back
to 2 and PLUG to 100. But as the removed comment in sch_plug.c stated:

| /* We will set a default limit of 100 pkts (~150kB)
|  * in case tx_queue_len is not available. The
|  * default value is completely arbitrary.
|  */

It seems not to be overly important to fallback to the exact value each
qdisc had before. Therefore I guess the following change should
appropriately fix the issue at hand:

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7440,8 +7440,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
setup(dev);
 
-   if (!dev->tx_queue_len)
+   if (!dev->tx_queue_len) {
dev->priv_flags |= IFF_NO_QUEUE;
+   dev->tx_queue_len = 1;
+   }
 
dev->num_tx_queues = txqs;
dev->real_num_tx_queues = txqs;

Unless there is concern, I will formally submit this later.

Thanks, Phil


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-04 Thread Phil Sutter
On Fri, Dec 04, 2015 at 09:45:20AM -0800, Eric Dumazet wrote:
> On Fri, 2015-12-04 at 18:01 +0100, Phil Sutter wrote:
> > On Fri, Dec 04, 2015 at 10:39:56PM +0800, Herbert Xu wrote:
> > > On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
> > > >
> > > > Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?
> > > 
> > > OK I've tried it and I no longer get any ENOMEM errors!
> > 
> > I can't confirm this, sadly. Using 50 threads, results seem to be stable
> > and good. But increasing the number of threads I can provoke ENOMEM
> > condition again. See attached log which shows a failing test run with
> > 100 threads.
> > 
> > I tried to extract logs of a test run with as few as possible failing
> > threads, but wasn't successful. It seems like the error amplifies
> > itself: While having stable success with less than 70 threads, going
> > beyond a margin I could not identify exactly, much more threads failed
> > than expected. For instance, the attached log shows 70 out of 100
> > threads failing, while for me every single test with 50 threads was
> > successful.
> 
> But this patch is about GFP_ATOMIC allocations, I doubt your test is
> using GFP_ATOMIC.
> 
> Threads (process context) should use GFP_KERNEL allocations.

Well, I assumed Herbert did his tests using test_rhashtable, and
therefore fixed whatever code-path that triggers. Maybe I'm wrong,
though.

Looking at the vmalloc allocation failure trace, it seems like it's
trying to indeed use GFP_ATOMIC from inside those threads: If I don't
miss anything, bucket_table_alloc is called from
rhashtable_insert_rehash, which passes GFP_ATOMIC unconditionally. But
then again bucket_table_alloc should use kzalloc if 'gfp != GFP_KERNEL',
so I'm probably just cross-eyed right now.

> BTW, if 100 threads are simultaneously trying to vmalloc(32 MB), this
> might not be very wise :(
> 
> Only one should really do this, while others are waiting.

Sure, that was my previous understanding of how this thing works.

> If we really want parallelism (multiple cpus coordinating their effort),
> it should be done very differently.

Maybe my approach of stress-testing rhashtable was too naive in the
first place.

Thanks, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-04 Thread Phil Sutter
On Fri, Dec 04, 2015 at 10:39:56PM +0800, Herbert Xu wrote:
> On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
> >
> > Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?
> 
> OK I've tried it and I no longer get any ENOMEM errors!

I can't confirm this, sadly. Using 50 threads, results seem to be stable
and good. But increasing the number of threads I can provoke ENOMEM
condition again. See attached log which shows a failing test run with
100 threads.

I tried to extract logs of a test run with as few as possible failing
threads, but wasn't successful. It seems like the error amplifies
itself: While having stable success with less than 70 threads, going
beyond a margin I could not identify exactly, much more threads failed
than expected. For instance, the attached log shows 70 out of 100
threads failing, while for me every single test with 50 threads was
successful.

HTH, Phil
[ 5196.212230] Running rhashtable test nelem=8, max_size=0, shrinking=0
[ 5196.243846] Test 00:
[ 5196.245990]   Adding 5 keys
[ 5196.250787] Info: encountered resize
[ 5196.251631] Info: encountered resize
[ 5196.252773] Info: encountered resize
[ 5196.256076]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=3
[ 5196.261961]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.263282]   Deleting 5 keys
[ 5196.267054]   Duration of test: 20359448 ns
[ 5196.267762] Test 01:
[ 5196.270278]   Adding 5 keys
[ 5196.276804] Info: encountered resize
[ 5196.278164]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=1
[ 5196.287668]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.289246]   Deleting 5 keys
[ 5196.293725]   Duration of test: 22689015 ns
[ 5196.294902] Test 02:
[ 5196.297878]   Adding 5 keys
[ 5196.305348] Info: encountered resize
[ 5196.306093] Info: encountered resize
[ 5196.306815] Info: encountered resize
[ 5196.307529] Info: encountered resize
[ 5196.308262] Info: encountered resize
[ 5196.308973] Info: encountered resize
[ 5196.309699] Info: encountered resize
[ 5196.310449] Info: encountered resize
[ 5196.311228] Info: encountered resize
[ 5196.311996] Info: encountered resize
[ 5196.312957] Info: encountered resize
[ 5196.314178] Info: encountered resize
[ 5196.318068]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=12
[ 5196.324140]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.325661]   Deleting 5 keys
[ 5196.338875]   Duration of test: 39997796 ns
[ 5196.339718] Test 03:
[ 5196.341610]   Adding 5 keys
[ 5196.349677]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.356153]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.357704]   Deleting 5 keys
[ 5196.362173]   Duration of test: 19844019 ns
[ 5196.363055] Average test time: 25722569
[ 5196.363815] Testing concurrent rhashtable access from 100 threads
[ 5196.684648] vmalloc: allocation failure, allocated 22126592 of 33562624 bytes
[ 5196.685195]   thread[87]: rhashtable_insert_fast failed
[ 5196.687075] rhashtable_thra: page allocation failure: order:0, mode:0x2088022
[ 5196.687652] vmalloc: allocation failure, allocated 22245376 of 33562624 bytes
[ 5196.687653] rhashtable_thra: page allocation failure: order:0, mode:0x2088022
[ 5196.687655] CPU: 1 PID: 12259 Comm: rhashtable_thra Not tainted 
4.4.0-rc1rhashtable+ #141
[ 5196.687656] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 5196.687659]   25caf0f8 88003094bc70 
81308384
[ 5196.687660]  02088022 88003094bd00 8117b18c 
81815d58
[ 5196.687661]  88003094bc90 0018 88003094bd10 
88003094bcb0
[ 5196.687661] Call Trace:
[ 5196.687667]  [] dump_stack+0x44/0x60
[ 5196.687669]  [] warn_alloc_failed+0xfc/0x170
[ 5196.687673]  [] ? alloc_pages_current+0x8c/0x110
[ 5196.687675]  [] __vmalloc_node_range+0x18e/0x290
[ 5196.687677]  [] ? bucket_table_alloc+0x4a/0xf0
[ 5196.687681]  [] __vmalloc+0x4a/0x50
[ 5196.687682]  [] ? bucket_table_alloc+0x4a/0xf0
[ 5196.687683]  [] bucket_table_alloc+0x4a/0xf0
[ 5196.687684]  [] rhashtable_insert_rehash+0x5d/0xe0
[ 5196.687687]  [] 
insert_retry.isra.9.constprop.15+0x177/0x270 [test_rhashtable]
[ 5196.687689]  [] threadfunc+0xa6/0x38e [test_rhashtable]
[ 5196.687691]  [] ? __schedule+0x2ac/0x920
[ 5196.687693]  [] ? 
insert_retry.isra.9.constprop.15+0x270/0x270 [test_rhashtable]
[ 5196.687695]  [] ? 
insert_retry.isra.9.constprop.15+0x270/0x270 [test_rhashtable]
[ 5196.687697]  [] kthread+0xd8/0xf0
[ 5196.687698]  [] ? kthread_park+0x60/0x60
[ 5196.687700]  [] ret_from_fork+0x3f/0x70
[ 5196.687701]  [] ? kthread_park+0x60/0x60
[ 5196.687702] Mem-Info:
[ 5196.687704] active_anon:16125 

Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-04 Thread Phil Sutter
On Fri, Dec 04, 2015 at 10:39:56PM +0800, Herbert Xu wrote:
> On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
> >
> > Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?
> 
> OK I've tried it and I no longer get any ENOMEM errors!

I can't confirm this, sadly. Using 50 threads, results seem to be stable
and good. But increasing the number of threads I can provoke ENOMEM
condition again. See attached log which shows a failing test run with
100 threads.

I tried to extract logs of a test run with as few as possible failing
threads, but wasn't successful. It seems like the error amplifies
itself: While having stable success with less than 70 threads, going
beyond a margin I could not identify exactly, much more threads failed
than expected. For instance, the attached log shows 70 out of 100
threads failing, while for me every single test with 50 threads was
successful.

HTH, Phil
[ 5196.212230] Running rhashtable test nelem=8, max_size=0, shrinking=0
[ 5196.243846] Test 00:
[ 5196.245990]   Adding 5 keys
[ 5196.250787] Info: encountered resize
[ 5196.251631] Info: encountered resize
[ 5196.252773] Info: encountered resize
[ 5196.256076]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=3
[ 5196.261961]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.263282]   Deleting 5 keys
[ 5196.267054]   Duration of test: 20359448 ns
[ 5196.267762] Test 01:
[ 5196.270278]   Adding 5 keys
[ 5196.276804] Info: encountered resize
[ 5196.278164]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=1
[ 5196.287668]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.289246]   Deleting 5 keys
[ 5196.293725]   Duration of test: 22689015 ns
[ 5196.294902] Test 02:
[ 5196.297878]   Adding 5 keys
[ 5196.305348] Info: encountered resize
[ 5196.306093] Info: encountered resize
[ 5196.306815] Info: encountered resize
[ 5196.307529] Info: encountered resize
[ 5196.308262] Info: encountered resize
[ 5196.308973] Info: encountered resize
[ 5196.309699] Info: encountered resize
[ 5196.310449] Info: encountered resize
[ 5196.311228] Info: encountered resize
[ 5196.311996] Info: encountered resize
[ 5196.312957] Info: encountered resize
[ 5196.314178] Info: encountered resize
[ 5196.318068]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=12
[ 5196.324140]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.325661]   Deleting 5 keys
[ 5196.338875]   Duration of test: 39997796 ns
[ 5196.339718] Test 03:
[ 5196.341610]   Adding 5 keys
[ 5196.349677]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.356153]   Traversal complete: counted=5, nelems=5, 
entries=5, table-jumps=0
[ 5196.357704]   Deleting 5 keys
[ 5196.362173]   Duration of test: 19844019 ns
[ 5196.363055] Average test time: 25722569
[ 5196.363815] Testing concurrent rhashtable access from 100 threads
[ 5196.684648] vmalloc: allocation failure, allocated 22126592 of 33562624 bytes
[ 5196.685195]   thread[87]: rhashtable_insert_fast failed
[ 5196.687075] rhashtable_thra: page allocation failure: order:0, mode:0x2088022
[ 5196.687652] vmalloc: allocation failure, allocated 22245376 of 33562624 bytes
[ 5196.687653] rhashtable_thra: page allocation failure: order:0, mode:0x2088022
[ 5196.687655] CPU: 1 PID: 12259 Comm: rhashtable_thra Not tainted 
4.4.0-rc1rhashtable+ #141
[ 5196.687656] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 5196.687659]   25caf0f8 88003094bc70 
81308384
[ 5196.687660]  02088022 88003094bd00 8117b18c 
81815d58
[ 5196.687661]  88003094bc90 0018 88003094bd10 
88003094bcb0
[ 5196.687661] Call Trace:
[ 5196.687667]  [] dump_stack+0x44/0x60
[ 5196.687669]  [] warn_alloc_failed+0xfc/0x170
[ 5196.687673]  [] ? alloc_pages_current+0x8c/0x110
[ 5196.687675]  [] __vmalloc_node_range+0x18e/0x290
[ 5196.687677]  [] ? bucket_table_alloc+0x4a/0xf0
[ 5196.687681]  [] __vmalloc+0x4a/0x50
[ 5196.687682]  [] ? bucket_table_alloc+0x4a/0xf0
[ 5196.687683]  [] bucket_table_alloc+0x4a/0xf0
[ 5196.687684]  [] rhashtable_insert_rehash+0x5d/0xe0
[ 5196.687687]  [] 
insert_retry.isra.9.constprop.15+0x177/0x270 [test_rhashtable]
[ 5196.687689]  [] threadfunc+0xa6/0x38e [test_rhashtable]
[ 5196.687691]  [] ? __schedule+0x2ac/0x920
[ 5196.687693]  [] ? 
insert_retry.isra.9.constprop.15+0x270/0x270 [test_rhashtable]
[ 5196.687695]  [] ? 
insert_retry.isra.9.constprop.15+0x270/0x270 [test_rhashtable]
[ 5196.687697]  [] kthread+0xd8/0xf0
[ 5196.687698]  [] ? kthread_park+0x60/0x60
[ 5196.687700]  [] ret_from_fork+0x3f/0x70
[ 5196.687701]  [] ? kthread_park+0x60/0x60
[ 5196.687702] Mem-Info:
[ 5196.687704] active_anon:16125 

Re: rhashtable: Use __vmalloc with GFP_ATOMIC for table allocation

2015-12-04 Thread Phil Sutter
On Fri, Dec 04, 2015 at 09:45:20AM -0800, Eric Dumazet wrote:
> On Fri, 2015-12-04 at 18:01 +0100, Phil Sutter wrote:
> > On Fri, Dec 04, 2015 at 10:39:56PM +0800, Herbert Xu wrote:
> > > On Thu, Dec 03, 2015 at 08:08:39AM -0800, Eric Dumazet wrote:
> > > >
> > > > Anyway, __vmalloc() can be used with GFP_ATOMIC, have you tried this ?
> > > 
> > > OK I've tried it and I no longer get any ENOMEM errors!
> > 
> > I can't confirm this, sadly. Using 50 threads, results seem to be stable
> > and good. But increasing the number of threads I can provoke ENOMEM
> > condition again. See attached log which shows a failing test run with
> > 100 threads.
> > 
> > I tried to extract logs of a test run with as few as possible failing
> > threads, but wasn't successful. It seems like the error amplifies
> > itself: While having stable success with less than 70 threads, going
> > beyond a margin I could not identify exactly, much more threads failed
> > than expected. For instance, the attached log shows 70 out of 100
> > threads failing, while for me every single test with 50 threads was
> > successful.
> 
> But this patch is about GFP_ATOMIC allocations, I doubt your test is
> using GFP_ATOMIC.
> 
> Threads (process context) should use GFP_KERNEL allocations.

Well, I assumed Herbert did his tests using test_rhashtable, and
therefore fixed whatever code-path that triggers. Maybe I'm wrong,
though.

Looking at the vmalloc allocation failure trace, it seems like it's
trying to indeed use GFP_ATOMIC from inside those threads: If I don't
miss anything, bucket_table_alloc is called from
rhashtable_insert_rehash, which passes GFP_ATOMIC unconditionally. But
then again bucket_table_alloc should use kzalloc if 'gfp != GFP_KERNEL',
so I'm probably just cross-eyed right now.

> BTW, if 100 threads are simultaneously trying to vmalloc(32 MB), this
> might not be very wise :(
> 
> Only one should really do this, while others are waiting.

Sure, that was my previous understanding of how this thing works.

> If we really want parallelism (multiple cpus coordinating their effort),
> it should be done very differently.

Maybe my approach of stress-testing rhashtable was too naive in the
first place.

Thanks, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Prevent spurious EBUSY errors on insertion

2015-12-03 Thread Phil Sutter
On Thu, Dec 03, 2015 at 08:41:29PM +0800, Herbert Xu wrote:
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> > 
> > OK that's better.  I think I see the problem.  The test in
> > rhashtable_insert_rehash is racy and if two threads both try
> > to grow the table one of them may be tricked into doing a rehash
> > instead.
> > 
> > I'm working on a fix.
> 
> OK this patch fixes the EBUSY problem as far as I can tell.  Please
> let me know if you still observe EBUSY with it.  I'll respond to the
> ENOMEM problem in another email.
> 
> ---8<---
> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
> 
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty.  In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed.  This is what leads to the EBUSY error.
> 
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash.  When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
> 
> Reported-by: Thomas Graf 
> Reported-by: Phil Sutter 
> Signed-off-by: Herbert Xu 

Tested-by: Phil Sutter 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: Prevent spurious EBUSY errors on insertion

2015-12-03 Thread Phil Sutter
On Thu, Dec 03, 2015 at 08:41:29PM +0800, Herbert Xu wrote:
> On Mon, Nov 30, 2015 at 06:18:59PM +0800, Herbert Xu wrote:
> > 
> > OK that's better.  I think I see the problem.  The test in
> > rhashtable_insert_rehash is racy and if two threads both try
> > to grow the table one of them may be tricked into doing a rehash
> > instead.
> > 
> > I'm working on a fix.
> 
> OK this patch fixes the EBUSY problem as far as I can tell.  Please
> let me know if you still observe EBUSY with it.  I'll respond to the
> ENOMEM problem in another email.
> 
> ---8<---
> Thomas and Phil observed that under stress rhashtable insertion
> sometimes failed with EBUSY, even though this error should only
> ever been seen when we're under attack and our hash chain length
> has grown to an unacceptable level, even after a rehash.
> 
> It turns out that the logic for detecting whether there is an
> existing rehash is faulty.  In particular, when two threads both
> try to grow the same table at the same time, one of them may see
> the newly grown table and thus erroneously conclude that it had
> been rehashed.  This is what leads to the EBUSY error.
> 
> This patch fixes this by remembering the current last table we
> used during insertion so that rhashtable_insert_rehash can detect
> when another thread has also done a resize/rehash.  When this is
> detected we will give up our resize/rehash and simply retry the
> insertion with the new table.
> 
> Reported-by: Thomas Graf <tg...@suug.ch>
> Reported-by: Phil Sutter <p...@nwl.cc>
> Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

Tested-by: Phil Sutter <p...@nwl.cc>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

2015-11-30 Thread Phil Sutter
On Mon, Nov 30, 2015 at 05:37:55PM +0800, Herbert Xu wrote:
> Phil Sutter  wrote:
> > The following series aims to improve lib/test_rhashtable in different
> > situations:
> > 
> > Patch 1 allows the kernel to reschedule so the test does not block too
> >long on slow systems.
> > Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
> >error case (-EBUSY).
> > Patch 3 auto-adjusts the upper table size limit according to the number
> >of threads (in concurrency test). In fact, the current default is
> >already too small.
> > Patch 4 makes it possible to retry inserts even in supposedly permanent
> >error case (-ENOMEM) to expose rhashtable's remaining problem of
> >-ENOMEM being not as permanent as it is expected to be.
> 
> I'm sorry but this patch series is simply bogus.

The whole series?!

> If rhashtable is indeed returning such errors under normal
> conditions then rhashtable is broken and we must fix it instead
> of working around it in the test code!

You're stating the obvious. Remember, the reason I prepared patch 4 was
because you wanted to fix just that bug in rhashtable in the first
place.

Just to make this clear: Patches 1-3 are reasonable on their own, the
only connection to the bug is that patch 2 makes it visible (at least on
my system it wasn't before).

> FWIW I still haven't been able to reproduce this problem, perhaps
> because my machines have too few CPUs?

Did you try with my bogus patch series applied? How many CPUs does your
test system actually have?

> So can someone please help me reproduce this? Because just loading
> test_rhashtable isn't doing it.

As said, maybe you need to increase the number of spawned threads
(tcount=50 or so).

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

2015-11-30 Thread Phil Sutter
On Mon, Nov 30, 2015 at 05:37:55PM +0800, Herbert Xu wrote:
> Phil Sutter <p...@nwl.cc> wrote:
> > The following series aims to improve lib/test_rhashtable in different
> > situations:
> > 
> > Patch 1 allows the kernel to reschedule so the test does not block too
> >long on slow systems.
> > Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
> >error case (-EBUSY).
> > Patch 3 auto-adjusts the upper table size limit according to the number
> >of threads (in concurrency test). In fact, the current default is
> >already too small.
> > Patch 4 makes it possible to retry inserts even in supposedly permanent
> >error case (-ENOMEM) to expose rhashtable's remaining problem of
> >-ENOMEM being not as permanent as it is expected to be.
> 
> I'm sorry but this patch series is simply bogus.

The whole series?!

> If rhashtable is indeed returning such errors under normal
> conditions then rhashtable is broken and we must fix it instead
> of working around it in the test code!

You're stating the obvious. Remember, the reason I prepared patch 4 was
because you wanted to fix just that bug in rhashtable in the first
place.

Just to make this clear: Patches 1-3 are reasonable on their own, the
only connection to the bug is that patch 2 makes it visible (at least on
my system it wasn't before).

> FWIW I still haven't been able to reproduce this problem, perhaps
> because my machines have too few CPUs?

Did you try with my bogus patch series applied? How many CPUs does your
test system actually have?

> So can someone please help me reproduce this? Because just loading
> test_rhashtable isn't doing it.

As said, maybe you need to increase the number of spawned threads
(tcount=50 or so).

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] Crypto kernel tls socket

2015-11-24 Thread Phil Sutter
Hi,

On Tue, Nov 24, 2015 at 12:20:00PM +0100, Hannes Frederic Sowa wrote:
> Stephan Mueller  writes:
> 
> > Am Dienstag, 24. November 2015, 18:34:55 schrieb Herbert Xu:
> >
> > Hi Herbert,
> >
> >>On Mon, Nov 23, 2015 at 09:43:02AM -0800, Dave Watson wrote:
> >>> Userspace crypto interface for TLS.  Currently supports gcm(aes) 128bit
> >>> only, however the interface is the same as the rest of the SOCK_ALG
> >>> interface, so it should be possible to add more without any user interface
> >>> changes.
> >>
> >>SOCK_ALG exists to export crypto algorithms to user-space.  So if
> >>we decided to support TLS as an algorithm then I guess this makes
> >>sense.
> >>
> >>However, I must say that it wouldn't have been my first pick.  I'd
> >>imagine a TLS socket to look more like a TCP socket, or perhaps a
> >>KCM socket as proposed by Tom.
> >
> > If I may ask: what is the benefit of having TLS in kernel space? I do not 
> > see 
> > any reason why higher-level protocols should be in the kernel as they do 
> > not 
> > relate to accessing hardware.
> 
> There are some crypto acclerators out there so that putting tls into the
> kernel would give a net benefit, because otherwise user space has to
> copy data into the kernel for device access and back to user space until
> it can finally be send out on the wire.
> 
> Since processors provide aesni and other crypto extensions as part of
> their instruction set architecture, this, of course, does not make sense
> any more.

There "still" are dedicated crypto engines out there which need a driver
to be accessed, so using them from userspace is not as simple as with
padlock or AESNI. This was the reasoning behind the various cryptodev
implementations and af_alg. Using those to establish a TLS connection
with OpenSSL means to fetch encrypted data to userspace first and then
feed it to the kernel again for decryption. Using cryptodev-linux, this
will be zero-copy, but still there's an additional context switch
involved which the approach here avoids.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] Crypto kernel tls socket

2015-11-24 Thread Phil Sutter
Hi,

On Tue, Nov 24, 2015 at 12:20:00PM +0100, Hannes Frederic Sowa wrote:
> Stephan Mueller  writes:
> 
> > Am Dienstag, 24. November 2015, 18:34:55 schrieb Herbert Xu:
> >
> > Hi Herbert,
> >
> >>On Mon, Nov 23, 2015 at 09:43:02AM -0800, Dave Watson wrote:
> >>> Userspace crypto interface for TLS.  Currently supports gcm(aes) 128bit
> >>> only, however the interface is the same as the rest of the SOCK_ALG
> >>> interface, so it should be possible to add more without any user interface
> >>> changes.
> >>
> >>SOCK_ALG exists to export crypto algorithms to user-space.  So if
> >>we decided to support TLS as an algorithm then I guess this makes
> >>sense.
> >>
> >>However, I must say that it wouldn't have been my first pick.  I'd
> >>imagine a TLS socket to look more like a TCP socket, or perhaps a
> >>KCM socket as proposed by Tom.
> >
> > If I may ask: what is the benefit of having TLS in kernel space? I do not 
> > see 
> > any reason why higher-level protocols should be in the kernel as they do 
> > not 
> > relate to accessing hardware.
> 
> There are some crypto acclerators out there so that putting tls into the
> kernel would give a net benefit, because otherwise user space has to
> copy data into the kernel for device access and back to user space until
> it can finally be send out on the wire.
> 
> Since processors provide aesni and other crypto extensions as part of
> their instruction set architecture, this, of course, does not make sense
> any more.

There "still" are dedicated crypto engines out there which need a driver
to be accessed, so using them from userspace is not as simple as with
padlock or AESNI. This was the reasoning behind the various cryptodev
implementations and af_alg. Using those to establish a TLS connection
with OpenSSL means to fetch encrypted data to userspace first and then
feed it to the kernel again for decryption. Using cryptodev-linux, this
will be zero-copy, but still there's an additional context switch
involved which the approach here avoids.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned

2015-11-20 Thread Phil Sutter
On Fri, Nov 20, 2015 at 06:17:20PM +0100, Phil Sutter wrote:
> This is rather a hack to expose the current issue with rhashtable to
> under high pressure sometimes return -ENOMEM even though system memory
> is not exhausted and a consecutive insert may succeed.

Please note that this problem does not show every time when running the
test in default configuration on my system. With increased number of
threads though, it becomes very visible. Load test_rhashtable like so:

modprobe test_rhashtable enomem_retry=1 tcount=20

and grep dmesg for 'insertions retried after -ENOMEM'. In my case:

# dmesg | grep -E '(insertions retried after -ENOMEM|Started)' | tail
[   34.642980]  1 insertions retried after -ENOMEM
[   34.642989]  1 insertions retried after -ENOMEM
[   34.642994]  1 insertions retried after -ENOMEM
[   34.648353]  28294 insertions retried after -ENOMEM
[   34.689687]  31262 insertions retried after -ENOMEM
[   34.714015]  16280 insertions retried after -ENOMEM
[   34.736019]  15327 insertions retried after -ENOMEM
[   34.755100]  39012 insertions retried after -ENOMEM
[   34.769116]  49369 insertions retried after -ENOMEM
[   35.387200] Started 20 threads, 0 failed

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned

2015-11-20 Thread Phil Sutter
This is rather a hack to expose the current issue with rhashtable to
under high pressure sometimes return -ENOMEM even though system memory
is not exhausted and a consecutive insert may succeed.

Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 6fa77b3..270bf72 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -52,6 +52,10 @@ static int tcount = 10;
 module_param(tcount, int, 0);
 MODULE_PARM_DESC(tcount, "Number of threads to spawn (default: 10)");
 
+static bool enomem_retry = false;
+module_param(enomem_retry, bool, 0);
+MODULE_PARM_DESC(enomem_retry, "Retry insert even if -ENOMEM was returned 
(default: off)");
+
 struct test_obj {
int value;
struct rhash_head   node;
@@ -79,14 +83,22 @@ static struct semaphore startup_sem = 
__SEMAPHORE_INITIALIZER(startup_sem, 0);
 static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
 const struct rhashtable_params params)
 {
-   int err, retries = -1;
+   int err, retries = -1, enomem_retries = 0;
 
do {
retries++;
cond_resched();
err = rhashtable_insert_fast(ht, obj, params);
+   if (err == -ENOMEM && enomem_retry) {
+   enomem_retries++;
+   err = -EBUSY;
+   }
} while (err == -EBUSY);
 
+   if (enomem_retries)
+   pr_info(" %u insertions retried after -ENOMEM\n",
+   enomem_retries);
+
return err ? : retries;
 }
 
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/4] rhashtable-test: add cond_resched() to thread test

2015-11-20 Thread Phil Sutter
This should fix for soft lockup bugs triggered on slow systems.

Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 8c1ad1c..63654e3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -236,6 +236,8 @@ static int thread_lookup_test(struct thread_data *tdata)
   obj->value, key);
err++;
}
+
+   cond_resched();
}
return err;
 }
@@ -251,6 +253,7 @@ static int threadfunc(void *data)
 
for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
+   cond_resched();
err = rhashtable_insert_fast(, >objs[i].node,
 test_rht_params);
if (err == -ENOMEM || err == -EBUSY) {
@@ -285,6 +288,8 @@ static int threadfunc(void *data)
goto out;
}
tdata->objs[i].value = TEST_INSERT_FAIL;
+
+   cond_resched();
}
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

2015-11-20 Thread Phil Sutter
The following series aims to improve lib/test_rhashtable in different
situations:

Patch 1 allows the kernel to reschedule so the test does not block too
long on slow systems.
Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
error case (-EBUSY).
Patch 3 auto-adjusts the upper table size limit according to the number
of threads (in concurrency test). In fact, the current default is
already too small.
Patch 4 makes it possible to retry inserts even in supposedly permanent
error case (-ENOMEM) to expose rhashtable's remaining problem of
-ENOMEM being not as permanent as it is expected to be.

Changes since v1:
- Introduce insert_retry() which is then used in single-threaded test as
  well.
- Do not retry inserts by default if -ENOMEM was returned.
- Rename the retry counter to be a bit more verbose about what it
  contains.
- Add patch 4 as a debugging aid.

Phil Sutter (4):
  rhashtable-test: add cond_resched() to thread test
  rhashtable-test: retry insert operations
  rhashtable-test: calculate max_entries value by default
  rhashtable-test: allow to retry even if -ENOMEM was returned

 lib/test_rhashtable.c | 76 +--
 1 file changed, 50 insertions(+), 26 deletions(-)

-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/4] rhashtable-test: calculate max_entries value by default

2015-11-20 Thread Phil Sutter
A maximum table size of 64k entries is insufficient for the multiple
threads test even in default configuration (10 threads * 5 objects =
50 objects in total). Since we know how many objects will be
inserted, calculate the max size unless overridden by parameter.

Note that specifying the exact number of objects upon table init won't
suffice as that value is being rounded down to the next power of two -
anticipate this by rounding up to the next power of two in beforehand.

Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index cfc3440..6fa77b3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -36,9 +36,9 @@ static int runs = 4;
 module_param(runs, int, 0);
 MODULE_PARM_DESC(runs, "Number of test runs per variant (default: 4)");
 
-static int max_size = 65536;
+static int max_size = 0;
 module_param(max_size, int, 0);
-MODULE_PARM_DESC(runs, "Maximum table size (default: 65536)");
+MODULE_PARM_DESC(runs, "Maximum table size (default: calculated)");
 
 static bool shrinking = false;
 module_param(shrinking, bool, 0);
@@ -321,7 +321,7 @@ static int __init test_rht_init(void)
entries = min(entries, MAX_ENTRIES);
 
test_rht_params.automatic_shrinking = shrinking;
-   test_rht_params.max_size = max_size;
+   test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
test_rht_params.nelem_hint = size;
 
pr_info("Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n",
@@ -367,6 +367,8 @@ static int __init test_rht_init(void)
return -ENOMEM;
}
 
+   test_rht_params.max_size = max_size ? :
+  roundup_pow_of_two(tcount * entries);
err = rhashtable_init(, _rht_params);
if (err < 0) {
pr_warn("Test failed: Unable to initialize hashtable: %d\n",
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/4] rhashtable-test: retry insert operations

2015-11-20 Thread Phil Sutter
After adding cond_resched() calls to threadfunc(), a surprisingly high
rate of insert failures occurred probably due to table resizes getting a
better chance to run in background. To not soften up the remaining
tests, retry inserts until they either succeed or fail permanently.

Also change the non-threaded test to retry insert operations, too.

Suggested-by: Thomas Graf 
Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 53 ---
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 63654e3..cfc3440 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -76,6 +76,20 @@ static struct rhashtable_params test_rht_params = {
 static struct semaphore prestart_sem;
 static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
 
+static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
+const struct rhashtable_params params)
+{
+   int err, retries = -1;
+
+   do {
+   retries++;
+   cond_resched();
+   err = rhashtable_insert_fast(ht, obj, params);
+   } while (err == -EBUSY);
+
+   return err ? : retries;
+}
+
 static int __init test_rht_lookup(struct rhashtable *ht)
 {
unsigned int i;
@@ -157,7 +171,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 {
struct test_obj *obj;
int err;
-   unsigned int i, insert_fails = 0;
+   unsigned int i, insert_retries = 0;
s64 start, end;
 
/*
@@ -170,22 +184,16 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
struct test_obj *obj = [i];
 
obj->value = i * 2;
-
-   err = rhashtable_insert_fast(ht, >node, test_rht_params);
-   if (err == -ENOMEM || err == -EBUSY) {
-   /* Mark failed inserts but continue */
-   obj->value = TEST_INSERT_FAIL;
-   insert_fails++;
-   } else if (err) {
+   err = insert_retry(ht, >node, test_rht_params);
+   if (err > 0)
+   insert_retries += err;
+   else if (err)
return err;
-   }
-
-   cond_resched();
}
 
-   if (insert_fails)
-   pr_info("  %u insertions failed due to memory pressure\n",
-   insert_fails);
+   if (insert_retries)
+   pr_info("  %u insertions retried due to memory pressure\n",
+   insert_retries);
 
test_bucket_stats(ht);
rcu_read_lock();
@@ -244,7 +252,7 @@ static int thread_lookup_test(struct thread_data *tdata)
 
 static int threadfunc(void *data)
 {
-   int i, step, err = 0, insert_fails = 0;
+   int i, step, err = 0, insert_retries = 0;
struct thread_data *tdata = data;
 
up(_sem);
@@ -253,21 +261,18 @@ static int threadfunc(void *data)
 
for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
-   cond_resched();
-   err = rhashtable_insert_fast(, >objs[i].node,
-test_rht_params);
-   if (err == -ENOMEM || err == -EBUSY) {
-   tdata->objs[i].value = TEST_INSERT_FAIL;
-   insert_fails++;
+   err = insert_retry(, >objs[i].node, test_rht_params);
+   if (err > 0) {
+   insert_retries += err;
} else if (err) {
pr_err("  thread[%d]: rhashtable_insert_fast failed\n",
   tdata->id);
goto out;
}
}
-   if (insert_fails)
-   pr_info("  thread[%d]: %d insert failures\n",
-   tdata->id, insert_fails);
+   if (insert_retries)
+   pr_info("  thread[%d]: %u insertions retried due to memory 
pressure\n",
+   tdata->id, insert_retries);
 
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: how to deal with that rhashtable_lookup_insert_key return -EBUSY

2015-11-20 Thread Phil Sutter
On Fri, Nov 20, 2015 at 01:14:18PM +0800, Xin Long wrote:
> when I use rhashtable_lookup_insert_key, sometimes it will return -EBUSY.
> im not sure if there is a good way to workabout it.
> or I should just try again and again until it's inserted successfully ?
> 
> I have seen some use in kernel  by now, but it seems that no one consider
> this issue for their cases. but it indeed exists in my case.
> 
> did I use it incorrectly or something else ?

AFAIK, insert returning -EBUSY is a situation users have to be aware of
and retry the insert. I sent a patch[1] to fix this in test_rhashtable.

That patch though retried in case of -ENOMEM as well, which was
considered wrong to do and therefore it wasn't accepted. But in my test
runs, -ENOMEM happened quite frequently and it also wasn't a permanent
error. For details, see the following discussion[2].

Herbert, did you manage to reproduce the problem meanwhile? If so, was
there any progress on fixing rhashtable? Otherwise, I could respin my
patch from [1] to cover only -EBUSY case by default and add a parameter
to make non-permanent -ENOMEM visible.

Cheers, Phil

[1]: https://lkml.org/lkml/2015/8/28/197
[2]: https://lkml.org/lkml/2015/8/28/281
> 
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: rhashtable: how to deal with that rhashtable_lookup_insert_key return -EBUSY

2015-11-20 Thread Phil Sutter
On Fri, Nov 20, 2015 at 01:14:18PM +0800, Xin Long wrote:
> when I use rhashtable_lookup_insert_key, sometimes it will return -EBUSY.
> im not sure if there is a good way to workabout it.
> or I should just try again and again until it's inserted successfully ?
> 
> I have seen some use in kernel  by now, but it seems that no one consider
> this issue for their cases. but it indeed exists in my case.
> 
> did I use it incorrectly or something else ?

AFAIK, insert returning -EBUSY is a situation users have to be aware of
and retry the insert. I sent a patch[1] to fix this in test_rhashtable.

That patch though retried in case of -ENOMEM as well, which was
considered wrong to do and therefore it wasn't accepted. But in my test
runs, -ENOMEM happened quite frequently and it also wasn't a permanent
error. For details, see the following discussion[2].

Herbert, did you manage to reproduce the problem meanwhile? If so, was
there any progress on fixing rhashtable? Otherwise, I could respin my
patch from [1] to cover only -EBUSY case by default and add a parameter
to make non-permanent -ENOMEM visible.

Cheers, Phil

[1]: https://lkml.org/lkml/2015/8/28/197
[2]: https://lkml.org/lkml/2015/8/28/281
> 
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/4] rhashtable-test: retry insert operations

2015-11-20 Thread Phil Sutter
After adding cond_resched() calls to threadfunc(), a surprisingly high
rate of insert failures occurred probably due to table resizes getting a
better chance to run in background. To not soften up the remaining
tests, retry inserts until they either succeed or fail permanently.

Also change the non-threaded test to retry insert operations, too.

Suggested-by: Thomas Graf <tg...@suug.ch>
Signed-off-by: Phil Sutter <p...@nwl.cc>
---
 lib/test_rhashtable.c | 53 ---
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 63654e3..cfc3440 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -76,6 +76,20 @@ static struct rhashtable_params test_rht_params = {
 static struct semaphore prestart_sem;
 static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
 
+static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
+const struct rhashtable_params params)
+{
+   int err, retries = -1;
+
+   do {
+   retries++;
+   cond_resched();
+   err = rhashtable_insert_fast(ht, obj, params);
+   } while (err == -EBUSY);
+
+   return err ? : retries;
+}
+
 static int __init test_rht_lookup(struct rhashtable *ht)
 {
unsigned int i;
@@ -157,7 +171,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 {
struct test_obj *obj;
int err;
-   unsigned int i, insert_fails = 0;
+   unsigned int i, insert_retries = 0;
s64 start, end;
 
/*
@@ -170,22 +184,16 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
struct test_obj *obj = [i];
 
obj->value = i * 2;
-
-   err = rhashtable_insert_fast(ht, >node, test_rht_params);
-   if (err == -ENOMEM || err == -EBUSY) {
-   /* Mark failed inserts but continue */
-   obj->value = TEST_INSERT_FAIL;
-   insert_fails++;
-   } else if (err) {
+   err = insert_retry(ht, >node, test_rht_params);
+   if (err > 0)
+   insert_retries += err;
+   else if (err)
return err;
-   }
-
-   cond_resched();
}
 
-   if (insert_fails)
-   pr_info("  %u insertions failed due to memory pressure\n",
-   insert_fails);
+   if (insert_retries)
+   pr_info("  %u insertions retried due to memory pressure\n",
+   insert_retries);
 
test_bucket_stats(ht);
rcu_read_lock();
@@ -244,7 +252,7 @@ static int thread_lookup_test(struct thread_data *tdata)
 
 static int threadfunc(void *data)
 {
-   int i, step, err = 0, insert_fails = 0;
+   int i, step, err = 0, insert_retries = 0;
struct thread_data *tdata = data;
 
up(_sem);
@@ -253,21 +261,18 @@ static int threadfunc(void *data)
 
for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
-   cond_resched();
-   err = rhashtable_insert_fast(, >objs[i].node,
-test_rht_params);
-   if (err == -ENOMEM || err == -EBUSY) {
-   tdata->objs[i].value = TEST_INSERT_FAIL;
-   insert_fails++;
+   err = insert_retry(, >objs[i].node, test_rht_params);
+   if (err > 0) {
+   insert_retries += err;
} else if (err) {
pr_err("  thread[%d]: rhashtable_insert_fast failed\n",
   tdata->id);
goto out;
}
}
-   if (insert_fails)
-   pr_info("  thread[%d]: %d insert failures\n",
-   tdata->id, insert_fails);
+   if (insert_retries)
+   pr_info("  thread[%d]: %u insertions retried due to memory 
pressure\n",
+   tdata->id, insert_retries);
 
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/4] rhashtable-test: calculate max_entries value by default

2015-11-20 Thread Phil Sutter
A maximum table size of 64k entries is insufficient for the multiple
threads test even in default configuration (10 threads * 5 objects =
50 objects in total). Since we know how many objects will be
inserted, calculate the max size unless overridden by parameter.

Note that specifying the exact number of objects upon table init won't
suffice as that value is being rounded down to the next power of two -
anticipate this by rounding up to the next power of two in beforehand.

Signed-off-by: Phil Sutter <p...@nwl.cc>
---
 lib/test_rhashtable.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index cfc3440..6fa77b3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -36,9 +36,9 @@ static int runs = 4;
 module_param(runs, int, 0);
 MODULE_PARM_DESC(runs, "Number of test runs per variant (default: 4)");
 
-static int max_size = 65536;
+static int max_size = 0;
 module_param(max_size, int, 0);
-MODULE_PARM_DESC(runs, "Maximum table size (default: 65536)");
+MODULE_PARM_DESC(runs, "Maximum table size (default: calculated)");
 
 static bool shrinking = false;
 module_param(shrinking, bool, 0);
@@ -321,7 +321,7 @@ static int __init test_rht_init(void)
entries = min(entries, MAX_ENTRIES);
 
test_rht_params.automatic_shrinking = shrinking;
-   test_rht_params.max_size = max_size;
+   test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
test_rht_params.nelem_hint = size;
 
pr_info("Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n",
@@ -367,6 +367,8 @@ static int __init test_rht_init(void)
return -ENOMEM;
}
 
+   test_rht_params.max_size = max_size ? :
+  roundup_pow_of_two(tcount * entries);
err = rhashtable_init(, _rht_params);
if (err < 0) {
pr_warn("Test failed: Unable to initialize hashtable: %d\n",
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned

2015-11-20 Thread Phil Sutter
This is rather a hack to expose the current issue with rhashtable to
under high pressure sometimes return -ENOMEM even though system memory
is not exhausted and a consecutive insert may succeed.

Signed-off-by: Phil Sutter <p...@nwl.cc>
---
 lib/test_rhashtable.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 6fa77b3..270bf72 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -52,6 +52,10 @@ static int tcount = 10;
 module_param(tcount, int, 0);
 MODULE_PARM_DESC(tcount, "Number of threads to spawn (default: 10)");
 
+static bool enomem_retry = false;
+module_param(enomem_retry, bool, 0);
+MODULE_PARM_DESC(enomem_retry, "Retry insert even if -ENOMEM was returned 
(default: off)");
+
 struct test_obj {
int value;
struct rhash_head   node;
@@ -79,14 +83,22 @@ static struct semaphore startup_sem = 
__SEMAPHORE_INITIALIZER(startup_sem, 0);
 static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
 const struct rhashtable_params params)
 {
-   int err, retries = -1;
+   int err, retries = -1, enomem_retries = 0;
 
do {
retries++;
cond_resched();
err = rhashtable_insert_fast(ht, obj, params);
+   if (err == -ENOMEM && enomem_retry) {
+   enomem_retries++;
+   err = -EBUSY;
+   }
} while (err == -EBUSY);
 
+   if (enomem_retries)
+   pr_info(" %u insertions retried after -ENOMEM\n",
+   enomem_retries);
+
return err ? : retries;
 }
 
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/4] rhashtable-test: allow to retry even if -ENOMEM was returned

2015-11-20 Thread Phil Sutter
On Fri, Nov 20, 2015 at 06:17:20PM +0100, Phil Sutter wrote:
> This is rather a hack to expose the current issue with rhashtable to
> under high pressure sometimes return -ENOMEM even though system memory
> is not exhausted and a consecutive insert may succeed.

Please note that this problem does not show every time when running the
test in default configuration on my system. With increased number of
threads though, it becomes very visible. Load test_rhashtable like so:

modprobe test_rhashtable enomem_retry=1 tcount=20

and grep dmesg for 'insertions retried after -ENOMEM'. In my case:

# dmesg | grep -E '(insertions retried after -ENOMEM|Started)' | tail
[   34.642980]  1 insertions retried after -ENOMEM
[   34.642989]  1 insertions retried after -ENOMEM
[   34.642994]  1 insertions retried after -ENOMEM
[   34.648353]  28294 insertions retried after -ENOMEM
[   34.689687]  31262 insertions retried after -ENOMEM
[   34.714015]  16280 insertions retried after -ENOMEM
[   34.736019]  15327 insertions retried after -ENOMEM
[   34.755100]  39012 insertions retried after -ENOMEM
[   34.769116]  49369 insertions retried after -ENOMEM
[   35.387200] Started 20 threads, 0 failed

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/4] improve fault-tolerance of rhashtable runtime-test

2015-11-20 Thread Phil Sutter
The following series aims to improve lib/test_rhashtable in different
situations:

Patch 1 allows the kernel to reschedule so the test does not block too
long on slow systems.
Patch 2 fixes behaviour under pressure, retrying inserts in non-permanent
error case (-EBUSY).
Patch 3 auto-adjusts the upper table size limit according to the number
of threads (in concurrency test). In fact, the current default is
already too small.
Patch 4 makes it possible to retry inserts even in supposedly permanent
error case (-ENOMEM) to expose rhashtable's remaining problem of
-ENOMEM being not as permanent as it is expected to be.

Changes since v1:
- Introduce insert_retry() which is then used in single-threaded test as
  well.
- Do not retry inserts by default if -ENOMEM was returned.
- Rename the retry counter to be a bit more verbose about what it
  contains.
- Add patch 4 as a debugging aid.

Phil Sutter (4):
  rhashtable-test: add cond_resched() to thread test
  rhashtable-test: retry insert operations
  rhashtable-test: calculate max_entries value by default
  rhashtable-test: allow to retry even if -ENOMEM was returned

 lib/test_rhashtable.c | 76 +--
 1 file changed, 50 insertions(+), 26 deletions(-)

-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/4] rhashtable-test: add cond_resched() to thread test

2015-11-20 Thread Phil Sutter
This should fix for soft lockup bugs triggered on slow systems.

Signed-off-by: Phil Sutter <p...@nwl.cc>
---
 lib/test_rhashtable.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 8c1ad1c..63654e3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -236,6 +236,8 @@ static int thread_lookup_test(struct thread_data *tdata)
   obj->value, key);
err++;
}
+
+   cond_resched();
}
return err;
 }
@@ -251,6 +253,7 @@ static int threadfunc(void *data)
 
for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
+   cond_resched();
err = rhashtable_insert_fast(, >objs[i].node,
 test_rht_params);
if (err == -ENOMEM || err == -EBUSY) {
@@ -285,6 +288,8 @@ static int threadfunc(void *data)
goto out;
}
tdata->objs[i].value = TEST_INSERT_FAIL;
+
+   cond_resched();
}
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-10 Thread Phil Sutter
On Thu, Sep 10, 2015 at 04:03:44PM +0800, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 04:51:24PM +0200, Thomas Graf wrote:
> > 
> > 1. The current in-kernel self-test
> > 2. bind_netlink.c: https://github.com/tgraf/rhashtable
> 
> I can't reproduce it:

I can't speak for netlink, but if you apply patch 1/3 from this thread,
test_rhashtable.c starts generating many insert failures during the
multiple thread test.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-10 Thread Phil Sutter
On Thu, Sep 10, 2015 at 04:03:44PM +0800, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 04:51:24PM +0200, Thomas Graf wrote:
> > 
> > 1. The current in-kernel self-test
> > 2. bind_netlink.c: https://github.com/tgraf/rhashtable
> 
> I can't reproduce it:

I can't speak for netlink, but if you apply patch 1/3 from this thread,
test_rhashtable.c starts generating many insert failures during the
multiple thread test.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Phil Sutter
On Tue, Sep 01, 2015 at 09:50:19PM +0800, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 03:43:11PM +0200, Phil Sutter wrote:
> >
> > Hmm. Since memory allocation is first tried with GFP_ATOMIC set and upon
> > failure retried in background, this seems like a situation which might
> > happen during normal use. If that already indicates a severe problem,
> > why retry in background at all?
> 
> It should be tried in the background first at 70% and only when
> that fails would we hit the 100% case and then we will try it
> with GFP_ATOMIC.  If that fails then the insertion will fail.

Ah, good to know. Thanks for clarifying this!

Looking at rhashtable_test.c, I see the initial table size is 8 entries.
70% of that is 5.6 entries, so background expansion is started after the
6th entry has been added, right? Given there are 10 threads running
which try to insert 50k entries at the same time, I don't think it's
unlikely that three more entries are inserted before the background
expansion completes.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Phil Sutter
On Tue, Sep 01, 2015 at 09:00:57PM +0800, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 02:46:48PM +0200, Phil Sutter wrote:
> >
> > This is not an inherent behaviour of the implementation but general
> > agreement. The insertion may fail non-permanently (returning -EBUSY),
> > users are expected to handle this by retrying the operation.
> 
> Absolutely not.  The only reason for an insertion to fail is if we
> can't allocate enough memory.  Unless the user is also looping its
> kmalloc calls it definitely shouldn't be retrying the insert.

rhashtable_insert_fast() returns -EBUSY if the table is full
(rht_grow_above_100() returns true) and an asynchronous rehash operation
is active. AFAICT, this is not necessarily caused by memory pressure.

> If an expansion fails it means either that the system is suffering
> a catastrophic memory shortage, or the user of rhashtable is doing
> something wrong.

Hmm. Since memory allocation is first tried with GFP_ATOMIC set and upon
failure retried in background, this seems like a situation which might
happen during normal use. If that already indicates a severe problem,
why retry in background at all?

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Phil Sutter
On Tue, Sep 01, 2015 at 07:43:00PM +0800, Herbert Xu wrote:
> On Mon, Aug 31, 2015 at 01:00:12PM +0200, Phil Sutter wrote:
> > 
> > The variable would be used to track if the worker has failed to allocate
> > memory in background.
> > 
> > Since the failing insertion will be retried, subsequent inserts are not
> > necessary unrelated.
> 
> If an insertion fails it is never retried.  The only thing that is
> retried is the expansion of the table.  So I have no idea what
> you are talking about.

This is not an inherent behaviour of the implementation but general
agreement. The insertion may fail non-permanently (returning -EBUSY),
users are expected to handle this by retrying the operation.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Phil Sutter
On Tue, Sep 01, 2015 at 07:43:00PM +0800, Herbert Xu wrote:
> On Mon, Aug 31, 2015 at 01:00:12PM +0200, Phil Sutter wrote:
> > 
> > The variable would be used to track if the worker has failed to allocate
> > memory in background.
> > 
> > Since the failing insertion will be retried, subsequent inserts are not
> > necessary unrelated.
> 
> If an insertion fails it is never retried.  The only thing that is
> retried is the expansion of the table.  So I have no idea what
> you are talking about.

This is not an inherent behaviour of the implementation but general
agreement. The insertion may fail non-permanently (returning -EBUSY),
users are expected to handle this by retrying the operation.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Phil Sutter
On Tue, Sep 01, 2015 at 09:50:19PM +0800, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 03:43:11PM +0200, Phil Sutter wrote:
> >
> > Hmm. Since memory allocation is first tried with GFP_ATOMIC set and upon
> > failure retried in background, this seems like a situation which might
> > happen during normal use. If that already indicates a severe problem,
> > why retry in background at all?
> 
> It should be tried in the background first at 70% and only when
> that fails would we hit the 100% case and then we will try it
> with GFP_ATOMIC.  If that fails then the insertion will fail.

Ah, good to know. Thanks for clarifying this!

Looking at rhashtable_test.c, I see the initial table size is 8 entries.
70% of that is 5.6 entries, so background expansion is started after the
6th entry has been added, right? Given there are 10 threads running
which try to insert 50k entries at the same time, I don't think it's
unlikely that three more entries are inserted before the background
expansion completes.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-01 Thread Phil Sutter
On Tue, Sep 01, 2015 at 09:00:57PM +0800, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 02:46:48PM +0200, Phil Sutter wrote:
> >
> > This is not an inherent behaviour of the implementation but general
> > agreement. The insertion may fail non-permanently (returning -EBUSY),
> > users are expected to handle this by retrying the operation.
> 
> Absolutely not.  The only reason for an insertion to fail is if we
> can't allocate enough memory.  Unless the user is also looping its
> kmalloc calls it definitely shouldn't be retrying the insert.

rhashtable_insert_fast() returns -EBUSY if the table is full
(rht_grow_above_100() returns true) and an asynchronous rehash operation
is active. AFAICT, this is not necessarily caused by memory pressure.

> If an expansion fails it means either that the system is suffering
> a catastrophic memory shortage, or the user of rhashtable is doing
> something wrong.

Hmm. Since memory allocation is first tried with GFP_ATOMIC set and upon
failure retried in background, this seems like a situation which might
happen during normal use. If that already indicates a severe problem,
why retry in background at all?

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-31 Thread Phil Sutter
On Sun, Aug 30, 2015 at 03:47:17PM +0800, Herbert Xu wrote:
> Phil Sutter  wrote:
> >
> > Should we introduce a new field to struct rhashtable to track the
> > internal state? This might allow to clean up some rather obscure tests,
> > e.g. whether a table resize is in progress or not.
> 
> Why would we want to do that? The deferred expansion is done
> on a best effort basis so its failure has nothing to do with
> the failure of a subsequent insertion.

The variable would be used to track if the worker has failed to allocate
memory in background.

Since the failing insertion will be retried, subsequent inserts are not
necessary unrelated.

> The insertion must have tried its own last-ditch synchronous
> expansion and only fail if that fails.

Who do you mean with "the insertion"? The user or the API?

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-31 Thread Phil Sutter
On Sun, Aug 30, 2015 at 03:47:17PM +0800, Herbert Xu wrote:
> Phil Sutter <p...@nwl.cc> wrote:
> >
> > Should we introduce a new field to struct rhashtable to track the
> > internal state? This might allow to clean up some rather obscure tests,
> > e.g. whether a table resize is in progress or not.
> 
> Why would we want to do that? The deferred expansion is done
> on a best effort basis so its failure has nothing to do with
> the failure of a subsequent insertion.

The variable would be used to track if the worker has failed to allocate
memory in background.

Since the failing insertion will be retried, subsequent inserts are not
necessary unrelated.

> The insertion must have tried its own last-ditch synchronous
> expansion and only fail if that fails.

Who do you mean with "the insertion"? The user or the API?

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-29 Thread Phil Sutter
On Sat, Aug 29, 2015 at 12:43:03AM +0200, Thomas Graf wrote:
> On 08/28/15 at 03:34pm, Phil Sutter wrote:
> > Quite ugly, IMHO: rhashtable_insert_fast() may return -ENOMEM as
> > non-permanent error, if allocation in GFP_ATOMIC failed. In this case,
> > allocation in GFP_KERNEL is retried by rht_deferred_worker(). Sadly,
> > there is no way to determine if that has already been tried and failed.
> > 
> > The thread test triggers GFP_ATOMIC allocation failure quite easily, so
> > I can't really just ignore this issue. :)
> 
> Return EBUSY or ENOBUFS in the non-permanent case? It is definitely
> helpful if the API allows to differ between permanent and
> non-permanent errors.

Yes, indeed. Therefore rht_deferred_worker() needs to check the return
value of rhashtable_expand(). The question is how to propagate the error
condition, as the worker's return value is not being kept track of
(function returns void even).

Should we introduce a new field to struct rhashtable to track the
internal state? This might allow to clean up some rather obscure tests,
e.g. whether a table resize is in progress or not.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-29 Thread Phil Sutter
On Sat, Aug 29, 2015 at 12:43:03AM +0200, Thomas Graf wrote:
 On 08/28/15 at 03:34pm, Phil Sutter wrote:
  Quite ugly, IMHO: rhashtable_insert_fast() may return -ENOMEM as
  non-permanent error, if allocation in GFP_ATOMIC failed. In this case,
  allocation in GFP_KERNEL is retried by rht_deferred_worker(). Sadly,
  there is no way to determine if that has already been tried and failed.
  
  The thread test triggers GFP_ATOMIC allocation failure quite easily, so
  I can't really just ignore this issue. :)
 
 Return EBUSY or ENOBUFS in the non-permanent case? It is definitely
 helpful if the API allows to differ between permanent and
 non-permanent errors.

Yes, indeed. Therefore rht_deferred_worker() needs to check the return
value of rhashtable_expand(). The question is how to propagate the error
condition, as the worker's return value is not being kept track of
(function returns void even).

Should we introduce a new field to struct rhashtable to track the
internal state? This might allow to clean up some rather obscure tests,
e.g. whether a table resize is in progress or not.

Cheers, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Phil Sutter
On Fri, Aug 28, 2015 at 01:13:20PM +0200, Phil Sutter wrote:
> On Fri, Aug 28, 2015 at 01:09:29PM +0200, Thomas Graf wrote:
> > On 08/28/15 at 12:28pm, Phil Sutter wrote:
> > > After adding cond_resched() calls to threadfunc(), a surprisingly high
> > > rate of insert failures occurred probably due to table resizes getting a
> > > better chance to run in background. To not soften up the remaining
> > > tests, retry inserts until they either succeed or fail permanently.
> > > 
> > > Signed-off-by: Phil Sutter 
> > > ---
> > >  lib/test_rhashtable.c | 13 +++--
> > >  1 file changed, 7 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> > > index 63654e3..093cf84 100644
> > > --- a/lib/test_rhashtable.c
> > > +++ b/lib/test_rhashtable.c
> > > @@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data 
> > > *tdata)
> > >  
> > >  static int threadfunc(void *data)
> > >  {
> > > - int i, step, err = 0, insert_fails = 0;
> > > + int i, step, err = 0, retries = 0;
> > >   struct thread_data *tdata = data;
> > >  
> > >   up(_sem);
> > > @@ -253,21 +253,22 @@ static int threadfunc(void *data)
> > >  
> > >   for (i = 0; i < entries; i++) {
> > >   tdata->objs[i].value = (tdata->id << 16) | i;
> > > +insert_retry:
> > >   cond_resched();
> > >   err = rhashtable_insert_fast(, >objs[i].node,
> > >test_rht_params);
> > >   if (err == -ENOMEM || err == -EBUSY) {
> > > - tdata->objs[i].value = TEST_INSERT_FAIL;
> > > - insert_fails++;
> > > + retries++;
> > > + goto insert_retry;
> > 
> > Is it safe to retry indefinitely on ENOMEM? Retrying on EBUSY is
> > definitely an improvement and we should do the same in the non
> > threaded test as well.
> 
> Oh yes, that is definitely a bug. I will respin and add the same for the
> normal test, too.

Quite ugly, IMHO: rhashtable_insert_fast() may return -ENOMEM as
non-permanent error, if allocation in GFP_ATOMIC failed. In this case,
allocation in GFP_KERNEL is retried by rht_deferred_worker(). Sadly,
there is no way to determine if that has already been tried and failed.

The thread test triggers GFP_ATOMIC allocation failure quite easily, so
I can't really just ignore this issue. :)

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Phil Sutter
On Fri, Aug 28, 2015 at 01:09:29PM +0200, Thomas Graf wrote:
> On 08/28/15 at 12:28pm, Phil Sutter wrote:
> > After adding cond_resched() calls to threadfunc(), a surprisingly high
> > rate of insert failures occurred probably due to table resizes getting a
> > better chance to run in background. To not soften up the remaining
> > tests, retry inserts until they either succeed or fail permanently.
> > 
> > Signed-off-by: Phil Sutter 
> > ---
> >  lib/test_rhashtable.c | 13 +++--
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> > index 63654e3..093cf84 100644
> > --- a/lib/test_rhashtable.c
> > +++ b/lib/test_rhashtable.c
> > @@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data *tdata)
> >  
> >  static int threadfunc(void *data)
> >  {
> > -   int i, step, err = 0, insert_fails = 0;
> > +   int i, step, err = 0, retries = 0;
> > struct thread_data *tdata = data;
> >  
> > up(_sem);
> > @@ -253,21 +253,22 @@ static int threadfunc(void *data)
> >  
> > for (i = 0; i < entries; i++) {
> > tdata->objs[i].value = (tdata->id << 16) | i;
> > +insert_retry:
> > cond_resched();
> > err = rhashtable_insert_fast(, >objs[i].node,
> >  test_rht_params);
> > if (err == -ENOMEM || err == -EBUSY) {
> > -   tdata->objs[i].value = TEST_INSERT_FAIL;
> > -   insert_fails++;
> > +   retries++;
> > +   goto insert_retry;
> 
> Is it safe to retry indefinitely on ENOMEM? Retrying on EBUSY is
> definitely an improvement and we should do the same in the non
> threaded test as well.

Oh yes, that is definitely a bug. I will respin and add the same for the
normal test, too.

Thanks, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] rhashtable-test: calculate max_entries value by default

2015-08-28 Thread Phil Sutter
A maximum table size of 64k entries is insufficient for the multiple
threads test even in default configuration (10 threads * 5 objects =
50 objects in total). Since we know how many objects will be
inserted, calculate the max size unless overridden by parameter.

Note that specifying the exact number of objects upon table init won't
suffice as that value is being rounded down to the next power of two -
anticipate this by rounding up to the next power of two in beforehand.

Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 093cf84..73fcb8e 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -36,9 +36,9 @@ static int runs = 4;
 module_param(runs, int, 0);
 MODULE_PARM_DESC(runs, "Number of test runs per variant (default: 4)");
 
-static int max_size = 65536;
+static int max_size = 0;
 module_param(max_size, int, 0);
-MODULE_PARM_DESC(runs, "Maximum table size (default: 65536)");
+MODULE_PARM_DESC(runs, "Maximum table size (default: calculated)");
 
 static bool shrinking = false;
 module_param(shrinking, bool, 0);
@@ -317,7 +317,7 @@ static int __init test_rht_init(void)
entries = min(entries, MAX_ENTRIES);
 
test_rht_params.automatic_shrinking = shrinking;
-   test_rht_params.max_size = max_size;
+   test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
test_rht_params.nelem_hint = size;
 
pr_info("Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n",
@@ -363,6 +363,8 @@ static int __init test_rht_init(void)
return -ENOMEM;
}
 
+   test_rht_params.max_size = max_size ? :
+  roundup_pow_of_two(tcount * entries);
err = rhashtable_init(, _rht_params);
if (err < 0) {
pr_warn("Test failed: Unable to initialize hashtable: %d\n",
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Phil Sutter
After adding cond_resched() calls to threadfunc(), a surprisingly high
rate of insert failures occurred probably due to table resizes getting a
better chance to run in background. To not soften up the remaining
tests, retry inserts until they either succeed or fail permanently.

Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 63654e3..093cf84 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data *tdata)
 
 static int threadfunc(void *data)
 {
-   int i, step, err = 0, insert_fails = 0;
+   int i, step, err = 0, retries = 0;
struct thread_data *tdata = data;
 
up(_sem);
@@ -253,21 +253,22 @@ static int threadfunc(void *data)
 
for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
+insert_retry:
cond_resched();
err = rhashtable_insert_fast(, >objs[i].node,
 test_rht_params);
if (err == -ENOMEM || err == -EBUSY) {
-   tdata->objs[i].value = TEST_INSERT_FAIL;
-   insert_fails++;
+   retries++;
+   goto insert_retry;
} else if (err) {
pr_err("  thread[%d]: rhashtable_insert_fast failed\n",
   tdata->id);
goto out;
}
}
-   if (insert_fails)
-   pr_info("  thread[%d]: %d insert failures\n",
-   tdata->id, insert_fails);
+   if (retries)
+   pr_info("  thread[%d]: retried insert operation %d times\n",
+   tdata->id, retries);
 
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] rhashtable-test: add cond_resched() to thread test

2015-08-28 Thread Phil Sutter
This should fix for soft lockup bugs triggered on slow systems.

Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 8c1ad1c..63654e3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -236,6 +236,8 @@ static int thread_lookup_test(struct thread_data *tdata)
   obj->value, key);
err++;
}
+
+   cond_resched();
}
return err;
 }
@@ -251,6 +253,7 @@ static int threadfunc(void *data)
 
for (i = 0; i < entries; i++) {
tdata->objs[i].value = (tdata->id << 16) | i;
+   cond_resched();
err = rhashtable_insert_fast(, >objs[i].node,
 test_rht_params);
if (err == -ENOMEM || err == -EBUSY) {
@@ -285,6 +288,8 @@ static int threadfunc(void *data)
goto out;
}
tdata->objs[i].value = TEST_INSERT_FAIL;
+
+   cond_resched();
}
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] rhashtable-test: add cond_resched() to thread test

2015-08-28 Thread Phil Sutter
This should fix for soft lockup bugs triggered on slow systems.

Signed-off-by: Phil Sutter p...@nwl.cc
---
 lib/test_rhashtable.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 8c1ad1c..63654e3 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -236,6 +236,8 @@ static int thread_lookup_test(struct thread_data *tdata)
   obj-value, key);
err++;
}
+
+   cond_resched();
}
return err;
 }
@@ -251,6 +253,7 @@ static int threadfunc(void *data)
 
for (i = 0; i  entries; i++) {
tdata-objs[i].value = (tdata-id  16) | i;
+   cond_resched();
err = rhashtable_insert_fast(ht, tdata-objs[i].node,
 test_rht_params);
if (err == -ENOMEM || err == -EBUSY) {
@@ -285,6 +288,8 @@ static int threadfunc(void *data)
goto out;
}
tdata-objs[i].value = TEST_INSERT_FAIL;
+
+   cond_resched();
}
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Phil Sutter
After adding cond_resched() calls to threadfunc(), a surprisingly high
rate of insert failures occurred probably due to table resizes getting a
better chance to run in background. To not soften up the remaining
tests, retry inserts until they either succeed or fail permanently.

Signed-off-by: Phil Sutter p...@nwl.cc
---
 lib/test_rhashtable.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 63654e3..093cf84 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data *tdata)
 
 static int threadfunc(void *data)
 {
-   int i, step, err = 0, insert_fails = 0;
+   int i, step, err = 0, retries = 0;
struct thread_data *tdata = data;
 
up(prestart_sem);
@@ -253,21 +253,22 @@ static int threadfunc(void *data)
 
for (i = 0; i  entries; i++) {
tdata-objs[i].value = (tdata-id  16) | i;
+insert_retry:
cond_resched();
err = rhashtable_insert_fast(ht, tdata-objs[i].node,
 test_rht_params);
if (err == -ENOMEM || err == -EBUSY) {
-   tdata-objs[i].value = TEST_INSERT_FAIL;
-   insert_fails++;
+   retries++;
+   goto insert_retry;
} else if (err) {
pr_err(  thread[%d]: rhashtable_insert_fast failed\n,
   tdata-id);
goto out;
}
}
-   if (insert_fails)
-   pr_info(  thread[%d]: %d insert failures\n,
-   tdata-id, insert_fails);
+   if (retries)
+   pr_info(  thread[%d]: retried insert operation %d times\n,
+   tdata-id, retries);
 
err = thread_lookup_test(tdata);
if (err) {
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] rhashtable-test: calculate max_entries value by default

2015-08-28 Thread Phil Sutter
A maximum table size of 64k entries is insufficient for the multiple
threads test even in default configuration (10 threads * 5 objects =
50 objects in total). Since we know how many objects will be
inserted, calculate the max size unless overridden by parameter.

Note that specifying the exact number of objects upon table init won't
suffice as that value is being rounded down to the next power of two -
anticipate this by rounding up to the next power of two in beforehand.

Signed-off-by: Phil Sutter p...@nwl.cc
---
 lib/test_rhashtable.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 093cf84..73fcb8e 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -36,9 +36,9 @@ static int runs = 4;
 module_param(runs, int, 0);
 MODULE_PARM_DESC(runs, Number of test runs per variant (default: 4));
 
-static int max_size = 65536;
+static int max_size = 0;
 module_param(max_size, int, 0);
-MODULE_PARM_DESC(runs, Maximum table size (default: 65536));
+MODULE_PARM_DESC(runs, Maximum table size (default: calculated));
 
 static bool shrinking = false;
 module_param(shrinking, bool, 0);
@@ -317,7 +317,7 @@ static int __init test_rht_init(void)
entries = min(entries, MAX_ENTRIES);
 
test_rht_params.automatic_shrinking = shrinking;
-   test_rht_params.max_size = max_size;
+   test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
test_rht_params.nelem_hint = size;
 
pr_info(Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n,
@@ -363,6 +363,8 @@ static int __init test_rht_init(void)
return -ENOMEM;
}
 
+   test_rht_params.max_size = max_size ? :
+  roundup_pow_of_two(tcount * entries);
err = rhashtable_init(ht, test_rht_params);
if (err  0) {
pr_warn(Test failed: Unable to initialize hashtable: %d\n,
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Phil Sutter
On Fri, Aug 28, 2015 at 01:09:29PM +0200, Thomas Graf wrote:
 On 08/28/15 at 12:28pm, Phil Sutter wrote:
  After adding cond_resched() calls to threadfunc(), a surprisingly high
  rate of insert failures occurred probably due to table resizes getting a
  better chance to run in background. To not soften up the remaining
  tests, retry inserts until they either succeed or fail permanently.
  
  Signed-off-by: Phil Sutter p...@nwl.cc
  ---
   lib/test_rhashtable.c | 13 +++--
   1 file changed, 7 insertions(+), 6 deletions(-)
  
  diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
  index 63654e3..093cf84 100644
  --- a/lib/test_rhashtable.c
  +++ b/lib/test_rhashtable.c
  @@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data *tdata)
   
   static int threadfunc(void *data)
   {
  -   int i, step, err = 0, insert_fails = 0;
  +   int i, step, err = 0, retries = 0;
  struct thread_data *tdata = data;
   
  up(prestart_sem);
  @@ -253,21 +253,22 @@ static int threadfunc(void *data)
   
  for (i = 0; i  entries; i++) {
  tdata-objs[i].value = (tdata-id  16) | i;
  +insert_retry:
  cond_resched();
  err = rhashtable_insert_fast(ht, tdata-objs[i].node,
   test_rht_params);
  if (err == -ENOMEM || err == -EBUSY) {
  -   tdata-objs[i].value = TEST_INSERT_FAIL;
  -   insert_fails++;
  +   retries++;
  +   goto insert_retry;
 
 Is it safe to retry indefinitely on ENOMEM? Retrying on EBUSY is
 definitely an improvement and we should do the same in the non
 threaded test as well.

Oh yes, that is definitely a bug. I will respin and add the same for the
normal test, too.

Thanks, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-08-28 Thread Phil Sutter
On Fri, Aug 28, 2015 at 01:13:20PM +0200, Phil Sutter wrote:
 On Fri, Aug 28, 2015 at 01:09:29PM +0200, Thomas Graf wrote:
  On 08/28/15 at 12:28pm, Phil Sutter wrote:
   After adding cond_resched() calls to threadfunc(), a surprisingly high
   rate of insert failures occurred probably due to table resizes getting a
   better chance to run in background. To not soften up the remaining
   tests, retry inserts until they either succeed or fail permanently.
   
   Signed-off-by: Phil Sutter p...@nwl.cc
   ---
lib/test_rhashtable.c | 13 +++--
1 file changed, 7 insertions(+), 6 deletions(-)
   
   diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
   index 63654e3..093cf84 100644
   --- a/lib/test_rhashtable.c
   +++ b/lib/test_rhashtable.c
   @@ -244,7 +244,7 @@ static int thread_lookup_test(struct thread_data 
   *tdata)

static int threadfunc(void *data)
{
   - int i, step, err = 0, insert_fails = 0;
   + int i, step, err = 0, retries = 0;
 struct thread_data *tdata = data;

 up(prestart_sem);
   @@ -253,21 +253,22 @@ static int threadfunc(void *data)

 for (i = 0; i  entries; i++) {
 tdata-objs[i].value = (tdata-id  16) | i;
   +insert_retry:
 cond_resched();
 err = rhashtable_insert_fast(ht, tdata-objs[i].node,
  test_rht_params);
 if (err == -ENOMEM || err == -EBUSY) {
   - tdata-objs[i].value = TEST_INSERT_FAIL;
   - insert_fails++;
   + retries++;
   + goto insert_retry;
  
  Is it safe to retry indefinitely on ENOMEM? Retrying on EBUSY is
  definitely an improvement and we should do the same in the non
  threaded test as well.
 
 Oh yes, that is definitely a bug. I will respin and add the same for the
 normal test, too.

Quite ugly, IMHO: rhashtable_insert_fast() may return -ENOMEM as
non-permanent error, if allocation in GFP_ATOMIC failed. In this case,
allocation in GFP_KERNEL is retried by rht_deferred_worker(). Sadly,
there is no way to determine if that has already been tried and failed.

The thread test triggers GFP_ATOMIC allocation failure quite easily, so
I can't really just ignore this issue. :)

Cheers, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rhashtable-test] EIP is at lock_is_held

2015-08-26 Thread Phil Sutter
Hi,

(Full-quoting here due to added maling lists.)

Looks like this is a problem of slow systems. I will try to reproduce
and come up with a similar fix as in commit 685a015 ("rhashtable: Allow
other tasks to be scheduled in large lookup loops").

Thanks for reporting,

Phil

On Mon, Aug 24, 2015 at 12:40:43PM +0800, Fengguang Wu wrote:
> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git master
> 
> commit f4a3e90ba5739cfd761b6befadae9728bd3641ed
> Author: Phil Sutter 
> AuthorDate: Sat Aug 15 00:37:15 2015 +0200
> Commit: David S. Miller 
> CommitDate: Mon Aug 17 14:33:47 2015 -0700
> 
> rhashtable-test: extend to test concurrency
> 
> After having tested insertion, lookup, table walk and removal, spawn a
> number of threads running operations on the same rhashtable. Each of
> them will:
> 
> 1) insert it's own set of objects,
> 2) lookup every successfully inserted object and finally
> 3) remove objects in several rounds until all of them have been removed,
>making sure the remaining ones are still found after each round.
> 
> This should put a good amount of load onto the system and due to
> synchronising thread startup via two semaphores also extensive
> concurrent table access.
> 
> The default number of ten threads returned within half a second on my
> local VM with two cores. Running 200 threads took about four seconds. If
> slow systems suffer too much from this though, the default could be
> lowered or even set to zero so this extended test does not run at all by
> default.
> 
> Signed-off-by: Phil Sutter 
> Acked-by: Thomas Graf 
> Signed-off-by: David S. Miller 
> 
> +++++
> || c1f066d4ee | 
> f4a3e90ba5 | 6967aa466b |
> +++++
> | boot_successes | 1060   | 808   
>  | 106|
> | boot_failures  | 1  | 102   
>  | 21 |
> | INFO:possible_circular_locking_dependency_detected | 1  |   
>  ||
> | backtrace:vfs_readv| 1  |   
>  ||
> | backtrace:SyS_readv| 1  |   
>  ||
> | backtrace:blk_mq_sysfs_unregister  | 1  |   
>  ||
> | backtrace:blk_mq_queue_reinit_notify   | 1  |   
>  ||
> | backtrace:debug_hotplug_cpu| 1  |   
>  ||
> | backtrace:kernel_init_freeable | 1  |   
>  ||
> | EIP_is_at_lock_is_held | 0  | 63
>  | 16 |
> | Kernel_panic-not_syncing:softlockup:hung_tasks | 0  | 102   
>  | 19 |
> | backtrace:threadfunc   | 0  | 101   
>  | 19 |
> | EIP_is_at_rcu_read_lock_held   | 0  | 10
>  | 5  |
> | EIP_is_at_rcu_lockdep_current_cpu_online   | 0  | 9 
>  ||
> | EIP_is_at_thread_lookup_test   | 0  | 11
>  | 1  |
> | EIP_is_at_lock_release | 0  | 3 
>  ||
> | EIP_is_at_lockdep_rht_bucket_is_held   | 0  | 3 
>  ||
> | EIP_is_at_rcu_is_watching  | 0  | 5 
>  ||
> | backtrace:apic_timer_interrupt | 0  | 3 
>  ||
> | EIP_is_at_lock_acquire | 0  | 12
>  | 2  |
> | EIP_is_at_jhash| 0  | 11
>  ||
> | EIP_is_at_raw_spin_lock_bh | 0  | 1 
>  ||
> | EIP_is_at_debug_lockdep_rcu_enabled| 0  | 13
>  ||
> | EIP_is_at_lockdep_rht_mutex_is_held| 0  | 3 
>  | 2  |
> | EIP_is_at_raw_spin_unlock_bh   | 0  | 1 
>  ||
> | EIP_is_at_do_raw_spin_lock | 0  | 1 
>  |  

Re: [rhashtable-test] EIP is at lock_is_held

2015-08-26 Thread Phil Sutter
Hi,

(Full-quoting here due to added maling lists.)

Looks like this is a problem of slow systems. I will try to reproduce
and come up with a similar fix as in commit 685a015 (rhashtable: Allow
other tasks to be scheduled in large lookup loops).

Thanks for reporting,

Phil

On Mon, Aug 24, 2015 at 12:40:43PM +0800, Fengguang Wu wrote:
 Greetings,
 
 0day kernel testing robot got the below dmesg and the first bad commit is
 
 git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git master
 
 commit f4a3e90ba5739cfd761b6befadae9728bd3641ed
 Author: Phil Sutter p...@nwl.cc
 AuthorDate: Sat Aug 15 00:37:15 2015 +0200
 Commit: David S. Miller da...@davemloft.net
 CommitDate: Mon Aug 17 14:33:47 2015 -0700
 
 rhashtable-test: extend to test concurrency
 
 After having tested insertion, lookup, table walk and removal, spawn a
 number of threads running operations on the same rhashtable. Each of
 them will:
 
 1) insert it's own set of objects,
 2) lookup every successfully inserted object and finally
 3) remove objects in several rounds until all of them have been removed,
making sure the remaining ones are still found after each round.
 
 This should put a good amount of load onto the system and due to
 synchronising thread startup via two semaphores also extensive
 concurrent table access.
 
 The default number of ten threads returned within half a second on my
 local VM with two cores. Running 200 threads took about four seconds. If
 slow systems suffer too much from this though, the default could be
 lowered or even set to zero so this extended test does not run at all by
 default.
 
 Signed-off-by: Phil Sutter p...@nwl.cc
 Acked-by: Thomas Graf tg...@suug.ch
 Signed-off-by: David S. Miller da...@davemloft.net
 
 +++++
 || c1f066d4ee | 
 f4a3e90ba5 | 6967aa466b |
 +++++
 | boot_successes | 1060   | 808   
  | 106|
 | boot_failures  | 1  | 102   
  | 21 |
 | INFO:possible_circular_locking_dependency_detected | 1  |   
  ||
 | backtrace:vfs_readv| 1  |   
  ||
 | backtrace:SyS_readv| 1  |   
  ||
 | backtrace:blk_mq_sysfs_unregister  | 1  |   
  ||
 | backtrace:blk_mq_queue_reinit_notify   | 1  |   
  ||
 | backtrace:debug_hotplug_cpu| 1  |   
  ||
 | backtrace:kernel_init_freeable | 1  |   
  ||
 | EIP_is_at_lock_is_held | 0  | 63
  | 16 |
 | Kernel_panic-not_syncing:softlockup:hung_tasks | 0  | 102   
  | 19 |
 | backtrace:threadfunc   | 0  | 101   
  | 19 |
 | EIP_is_at_rcu_read_lock_held   | 0  | 10
  | 5  |
 | EIP_is_at_rcu_lockdep_current_cpu_online   | 0  | 9 
  ||
 | EIP_is_at_thread_lookup_test   | 0  | 11
  | 1  |
 | EIP_is_at_lock_release | 0  | 3 
  ||
 | EIP_is_at_lockdep_rht_bucket_is_held   | 0  | 3 
  ||
 | EIP_is_at_rcu_is_watching  | 0  | 5 
  ||
 | backtrace:apic_timer_interrupt | 0  | 3 
  ||
 | EIP_is_at_lock_acquire | 0  | 12
  | 2  |
 | EIP_is_at_jhash| 0  | 11
  ||
 | EIP_is_at_raw_spin_lock_bh | 0  | 1 
  ||
 | EIP_is_at_debug_lockdep_rcu_enabled| 0  | 13
  ||
 | EIP_is_at_lockdep_rht_mutex_is_held| 0  | 3 
  | 2  |
 | EIP_is_at_raw_spin_unlock_bh   | 0  | 1 
  ||
 | EIP_is_at_do_raw_spin_lock | 0  | 1 
  ||
 | EIP_is_at__local_bh_enable_ip  | 0  | 1 
  ||
 | EIP_is_at_threadfunc   | 0  | 1 
  ||
 | IP-Config:Auto-configuration_of_network_failed | 0  | 0 
  | 2

Re: [PATCH] rhashtable-test: extend to test concurrency

2015-08-16 Thread Phil Sutter
On Sun, Aug 16, 2015 at 08:12:35PM +0200, Florian Westphal wrote:
> Phil Sutter  wrote:
> > After having tested insertion, lookup, table walk and removal, spawn a
> > number of threads running operations on the same rhashtable. Each of
> > them will:
> 
> [..]
> 
> > +   if (down_interruptible(_sem))
> > +   pr_err("  thread[%d]: down_interruptible failed\n", tdata->id);
> 
> Why _interruptible?
> 
> Seems this should use down() instead.

According to the comment in kernel/locking/semaphore.c, down() is
deprecated and one should use down_interruptible() or down_killable()
instead. Apart from that, I don't see any problem with using down()
here. If the call fails, the code is pointless if not even broken
anyway.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rhashtable-test: extend to test concurrency

2015-08-16 Thread Phil Sutter
On Sun, Aug 16, 2015 at 08:12:35PM +0200, Florian Westphal wrote:
 Phil Sutter p...@nwl.cc wrote:
  After having tested insertion, lookup, table walk and removal, spawn a
  number of threads running operations on the same rhashtable. Each of
  them will:
 
 [..]
 
  +   if (down_interruptible(startup_sem))
  +   pr_err(  thread[%d]: down_interruptible failed\n, tdata-id);
 
 Why _interruptible?
 
 Seems this should use down() instead.

According to the comment in kernel/locking/semaphore.c, down() is
deprecated and one should use down_interruptible() or down_killable()
instead. Apart from that, I don't see any problem with using down()
here. If the call fails, the code is pointless if not even broken
anyway.

Cheers, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] rhashtable-test: extend to test concurrency

2015-08-14 Thread Phil Sutter
After having tested insertion, lookup, table walk and removal, spawn a
number of threads running operations on the same rhashtable. Each of
them will:

1) insert it's own set of objects,
2) lookup every successfully inserted object and finally
3) remove objects in several rounds until all of them have been removed,
   making sure the remaining ones are still found after each round.

This should put a good amount of load onto the system and due to
synchronising thread startup via two semaphores also extensive
concurrent table access.

The default number of ten threads returned within half a second on my
local VM with two cores. Running 200 threads took about four seconds. If
slow systems suffer too much from this though, the default could be
lowered or even set to zero so this extended test does not run at all by
default.

Signed-off-by: Phil Sutter 
---
 lib/test_rhashtable.c | 155 +-
 1 file changed, 154 insertions(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 9af7cef..a26d76f 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -16,9 +16,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -45,11 +47,21 @@ static int size = 8;
 module_param(size, int, 0);
 MODULE_PARM_DESC(size, "Initial size hint of table (default: 8)");
 
+static int tcount = 10;
+module_param(tcount, int, 0);
+MODULE_PARM_DESC(tcount, "Number of threads to spawn (default: 10)");
+
 struct test_obj {
int value;
struct rhash_head   node;
 };
 
+struct thread_data {
+   int id;
+   struct task_struct *task;
+   struct test_obj *objs;
+};
+
 static struct test_obj array[MAX_ENTRIES];
 
 static struct rhashtable_params test_rht_params = {
@@ -60,6 +72,9 @@ static struct rhashtable_params test_rht_params = {
.nulls_base = (3U << RHT_BASE_SHIFT),
 };
 
+static struct semaphore prestart_sem;
+static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
+
 static int __init test_rht_lookup(struct rhashtable *ht)
 {
unsigned int i;
@@ -200,10 +215,97 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 
 static struct rhashtable ht;
 
+static int thread_lookup_test(struct thread_data *tdata)
+{
+   int i, err = 0;
+
+   for (i = 0; i < entries; i++) {
+   struct test_obj *obj;
+   int key = (tdata->id << 16) | i;
+
+   obj = rhashtable_lookup_fast(, , test_rht_params);
+   if (obj && (tdata->objs[i].value == TEST_INSERT_FAIL)) {
+   pr_err("  found unexpected object %d\n", key);
+   err++;
+   } else if (!obj && (tdata->objs[i].value != TEST_INSERT_FAIL)) {
+   pr_err("  object %d not found!\n", key);
+   err++;
+   } else if (obj && (obj->value != key)) {
+   pr_err("  wrong object returned (got %d, expected 
%d)\n",
+  obj->value, key);
+   err++;
+   }
+   }
+   return err;
+}
+
+static int threadfunc(void *data)
+{
+   int i, step, err = 0, insert_fails = 0;
+   struct thread_data *tdata = data;
+
+   up(_sem);
+   if (down_interruptible(_sem))
+   pr_err("  thread[%d]: down_interruptible failed\n", tdata->id);
+
+   for (i = 0; i < entries; i++) {
+   tdata->objs[i].value = (tdata->id << 16) | i;
+   err = rhashtable_insert_fast(, >objs[i].node,
+test_rht_params);
+   if (err == -ENOMEM || err == -EBUSY) {
+   tdata->objs[i].value = TEST_INSERT_FAIL;
+   insert_fails++;
+   } else if (err) {
+   pr_err("  thread[%d]: rhashtable_insert_fast failed\n",
+  tdata->id);
+   goto out;
+   }
+   }
+   if (insert_fails)
+   pr_info("  thread[%d]: %d insert failures\n",
+   tdata->id, insert_fails);
+
+   err = thread_lookup_test(tdata);
+   if (err) {
+   pr_err("  thread[%d]: rhashtable_lookup_test failed\n",
+  tdata->id);
+   goto out;
+   }
+
+   for (step = 10; step > 0; step--) {
+   for (i = 0; i < entries; i += step) {
+   if (tdata->objs[i].value == TEST_INSERT_FAIL)
+   continue;
+   err = rhashtable_remove_fast(, >objs[i].node,
+test_rht_params);
+   if

Re: kthreads: sporadic NULL pointer dereference in exit_creds()

2015-08-14 Thread Phil Sutter
Hi,

I found the problem, it was a bug in my own code. For details see below:

On Wed, Aug 12, 2015 at 05:09:31PM +0200, Phil Sutter wrote:
[...]
> Here is the reproducer code (kthread_test.c) I used:
> 
> ---8<---
> #include 
> #include 
> #include 
> #include 
> 
> static int tcount = 100;
> module_param(tcount, int, 0);
> MODULE_PARM_DESC(tcount, "Number of threads to spawn (default: 100)");
> 
> static struct semaphore prestart_sem;
> static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
> 
> static int threadfunc(void *unused)
> {
>   up(_sem);
>   if (down_interruptible(_sem))
>   pr_warn("thread: down_interruptible failed!\n");
>   printk(KERN_INFO "thread: running\n");
>   return 0;
> }

This function exits without waiting for kthread_should_stop() to return
true, which allows for do_fork() to do the cleanup (inside
wait_for_vfork_done()).

> static int __init init_kthread_test(void)
> {
>   struct task_struct **tsk;
>   int i, err;
> 
>   tsk = kzalloc(tcount * sizeof(struct task_struct *), GFP_KERNEL);
>   sema_init(_sem, 1 - tcount);
> 
>   printk(KERN_INFO "%s: starting test run\n", __func__);
> 
>   for (i = 0; i < tcount; i++) {
>   tsk[i] = kthread_run(threadfunc, NULL, "thread[%d]", i);
>   if (IS_ERR(tsk[i]))
>   pr_warn("%s: kthread_run failed for thread %d\n", 
> __func__, i);
>   else
>   printk(KERN_INFO "%s: started thread at %p\n", 
> __func__, tsk[i]);
>   }
> 
>   if (down_interruptible(_sem))
>   pr_warn("%s: down_interruptible failed!\n", __func__);
>   for (i = 0; i < tcount; i++)
>   up(_sem);
> 
>   for (i = 0; i < tcount; i++) {
>   if (IS_ERR(tsk[i]))
>   continue;
>   if ((err = kthread_stop(tsk[i])))
>   pr_warn("%s: kthread_stop failed for thread %d: %d\n", 
> __func__, i, err);

Calling kthread_stop() for a thread that has already returned by itself
then leads to the problem of calling exit_creds() twice.

I wonder a bit if this should be covered for or not, as the call to
__put_task_struct is protected by a usage counter in struct task_struct.

I fixed the issue by looping over schedule() at the end of threadfunc()
until kthread_should_stop() returns true.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] rhashtable-test: extend to test concurrency

2015-08-14 Thread Phil Sutter
After having tested insertion, lookup, table walk and removal, spawn a
number of threads running operations on the same rhashtable. Each of
them will:

1) insert it's own set of objects,
2) lookup every successfully inserted object and finally
3) remove objects in several rounds until all of them have been removed,
   making sure the remaining ones are still found after each round.

This should put a good amount of load onto the system and due to
synchronising thread startup via two semaphores also extensive
concurrent table access.

The default number of ten threads returned within half a second on my
local VM with two cores. Running 200 threads took about four seconds. If
slow systems suffer too much from this though, the default could be
lowered or even set to zero so this extended test does not run at all by
default.

Signed-off-by: Phil Sutter p...@nwl.cc
---
 lib/test_rhashtable.c | 155 +-
 1 file changed, 154 insertions(+), 1 deletion(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 9af7cef..a26d76f 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -16,9 +16,11 @@
 #include linux/init.h
 #include linux/jhash.h
 #include linux/kernel.h
+#include linux/kthread.h
 #include linux/module.h
 #include linux/rcupdate.h
 #include linux/rhashtable.h
+#include linux/semaphore.h
 #include linux/slab.h
 #include linux/sched.h
 
@@ -45,11 +47,21 @@ static int size = 8;
 module_param(size, int, 0);
 MODULE_PARM_DESC(size, Initial size hint of table (default: 8));
 
+static int tcount = 10;
+module_param(tcount, int, 0);
+MODULE_PARM_DESC(tcount, Number of threads to spawn (default: 10));
+
 struct test_obj {
int value;
struct rhash_head   node;
 };
 
+struct thread_data {
+   int id;
+   struct task_struct *task;
+   struct test_obj *objs;
+};
+
 static struct test_obj array[MAX_ENTRIES];
 
 static struct rhashtable_params test_rht_params = {
@@ -60,6 +72,9 @@ static struct rhashtable_params test_rht_params = {
.nulls_base = (3U  RHT_BASE_SHIFT),
 };
 
+static struct semaphore prestart_sem;
+static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
+
 static int __init test_rht_lookup(struct rhashtable *ht)
 {
unsigned int i;
@@ -200,10 +215,97 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 
 static struct rhashtable ht;
 
+static int thread_lookup_test(struct thread_data *tdata)
+{
+   int i, err = 0;
+
+   for (i = 0; i  entries; i++) {
+   struct test_obj *obj;
+   int key = (tdata-id  16) | i;
+
+   obj = rhashtable_lookup_fast(ht, key, test_rht_params);
+   if (obj  (tdata-objs[i].value == TEST_INSERT_FAIL)) {
+   pr_err(  found unexpected object %d\n, key);
+   err++;
+   } else if (!obj  (tdata-objs[i].value != TEST_INSERT_FAIL)) {
+   pr_err(  object %d not found!\n, key);
+   err++;
+   } else if (obj  (obj-value != key)) {
+   pr_err(  wrong object returned (got %d, expected 
%d)\n,
+  obj-value, key);
+   err++;
+   }
+   }
+   return err;
+}
+
+static int threadfunc(void *data)
+{
+   int i, step, err = 0, insert_fails = 0;
+   struct thread_data *tdata = data;
+
+   up(prestart_sem);
+   if (down_interruptible(startup_sem))
+   pr_err(  thread[%d]: down_interruptible failed\n, tdata-id);
+
+   for (i = 0; i  entries; i++) {
+   tdata-objs[i].value = (tdata-id  16) | i;
+   err = rhashtable_insert_fast(ht, tdata-objs[i].node,
+test_rht_params);
+   if (err == -ENOMEM || err == -EBUSY) {
+   tdata-objs[i].value = TEST_INSERT_FAIL;
+   insert_fails++;
+   } else if (err) {
+   pr_err(  thread[%d]: rhashtable_insert_fast failed\n,
+  tdata-id);
+   goto out;
+   }
+   }
+   if (insert_fails)
+   pr_info(  thread[%d]: %d insert failures\n,
+   tdata-id, insert_fails);
+
+   err = thread_lookup_test(tdata);
+   if (err) {
+   pr_err(  thread[%d]: rhashtable_lookup_test failed\n,
+  tdata-id);
+   goto out;
+   }
+
+   for (step = 10; step  0; step--) {
+   for (i = 0; i  entries; i += step) {
+   if (tdata-objs[i].value == TEST_INSERT_FAIL)
+   continue;
+   err = rhashtable_remove_fast(ht, tdata-objs[i].node,
+test_rht_params);
+   if (err) {
+   pr_err(  thread[%d

Re: kthreads: sporadic NULL pointer dereference in exit_creds()

2015-08-14 Thread Phil Sutter
Hi,

I found the problem, it was a bug in my own code. For details see below:

On Wed, Aug 12, 2015 at 05:09:31PM +0200, Phil Sutter wrote:
[...]
 Here is the reproducer code (kthread_test.c) I used:
 
 ---8---
 #include linux/kthread.h
 #include linux/module.h
 #include linux/semaphore.h
 #include linux/slab.h
 
 static int tcount = 100;
 module_param(tcount, int, 0);
 MODULE_PARM_DESC(tcount, Number of threads to spawn (default: 100));
 
 static struct semaphore prestart_sem;
 static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
 
 static int threadfunc(void *unused)
 {
   up(prestart_sem);
   if (down_interruptible(startup_sem))
   pr_warn(thread: down_interruptible failed!\n);
   printk(KERN_INFO thread: running\n);
   return 0;
 }

This function exits without waiting for kthread_should_stop() to return
true, which allows for do_fork() to do the cleanup (inside
wait_for_vfork_done()).

 static int __init init_kthread_test(void)
 {
   struct task_struct **tsk;
   int i, err;
 
   tsk = kzalloc(tcount * sizeof(struct task_struct *), GFP_KERNEL);
   sema_init(prestart_sem, 1 - tcount);
 
   printk(KERN_INFO %s: starting test run\n, __func__);
 
   for (i = 0; i  tcount; i++) {
   tsk[i] = kthread_run(threadfunc, NULL, thread[%d], i);
   if (IS_ERR(tsk[i]))
   pr_warn(%s: kthread_run failed for thread %d\n, 
 __func__, i);
   else
   printk(KERN_INFO %s: started thread at %p\n, 
 __func__, tsk[i]);
   }
 
   if (down_interruptible(prestart_sem))
   pr_warn(%s: down_interruptible failed!\n, __func__);
   for (i = 0; i  tcount; i++)
   up(startup_sem);
 
   for (i = 0; i  tcount; i++) {
   if (IS_ERR(tsk[i]))
   continue;
   if ((err = kthread_stop(tsk[i])))
   pr_warn(%s: kthread_stop failed for thread %d: %d\n, 
 __func__, i, err);

Calling kthread_stop() for a thread that has already returned by itself
then leads to the problem of calling exit_creds() twice.

I wonder a bit if this should be covered for or not, as the call to
__put_task_struct is protected by a usage counter in struct task_struct.

I fixed the issue by looping over schedule() at the end of threadfunc()
until kthread_should_stop() returns true.

Cheers, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kthreads: sporadic NULL pointer dereference in exit_creds()

2015-08-12 Thread Phil Sutter
[please keep me on Cc: since I am not subscribed to this list]

Hi,

While enhancing lib/test_rhashtable.c by a few threads to provoke
concurrency issues, I encountered a bug in the kernel's cleanup routines
for kthreads. Upon calling kthread_stop(), it would occasionally call
exit_creds() for the same task_struct pointer twice, thereby crashing
the kernel in the second invocation due to dereferencing tsk->cred and
tsk->real_cred being NULL.

I managed to isolate the bug trigger into a small kernel module which:
1) creates a number of kernel threads
2) uses two semaphores to synchronise the thread startup and make the
   module init code wait for them to actually run before continuing
3) calls kthread_stop() for all started threads.

I could successfully verify the problem using Linus's tree at commit
58ccab91342c1cc1fe08da9b198ac5d763706c2e (4.2.0-rc6) on a qemu machine
with only a single virtual CPU. In order to analyze the problem, I've
added a printk() to exit_creds():

---8<---
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -152,6 +152,7 @@ void exit_creds(struct task_struct *tsk)
 {
struct cred *cred;
 
+   printk(KERN_INFO "%s: called for tsk at %p\n", __func__, tsk);
kdebug("exit_creds(%u,%p,%p,{%d,%d})", tsk->pid, tsk->real_cred, 
tsk->cred,
   atomic_read(>cred->usage),
   read_cred_subscribers(tsk->cred));
---8<---

Here is the kernel log after calling 'insmod ./kthread_test.ko tcount=20':

---8<---
[   49.566682] kthread_test: module verification failed: signature and/or 
required key missing - tainting kernel
[   49.570457] init_kthread_test: starting test run
[   49.576535] init_kthread_test: started thread at 88003ae18000
[   49.577552] init_kthread_test: started thread at 88003ae1a300
[   49.578585] init_kthread_test: started thread at 88003ae1c600
[   49.579580] init_kthread_test: started thread at 88003ae1d780
[   49.580565] init_kthread_test: started thread at 88003ae1e900
[   49.581553] init_kthread_test: started thread at 88003ac96900
[   49.582604] init_kthread_test: started thread at 88003ac95780
[   49.583607] init_kthread_test: started thread at 88003ac94600
[   49.584766] init_kthread_test: started thread at 88003ac93480
[   49.585783] init_kthread_test: started thread at 88003ac92300
[   49.586775] init_kthread_test: started thread at 88003ac91180
[   49.587772] init_kthread_test: started thread at 88003ac9
[   49.588763] init_kthread_test: started thread at 88003a9f
[   49.589760] init_kthread_test: started thread at 88003a9f1180
[   49.590755] init_kthread_test: started thread at 88003a9f2300
[   49.591747] init_kthread_test: started thread at 88003a9f3480
[   49.592742] init_kthread_test: started thread at 88003a9f4600
[   49.593731] init_kthread_test: started thread at 88003a9f5780
[   49.594722] init_kthread_test: started thread at 88003a9f6900
[   49.595719] init_kthread_test: started thread at 88003aa18000
[   49.596711] thread: running
[   49.597329] thread: running
[   49.597918] thread: running
[   49.598553] thread: running
[   49.599166] thread: running
[   49.599755] thread: running
[   49.600356] thread: running
[   49.601161] thread: running
[   49.602028] thread: running
[   49.613245] thread: running
[   49.613850] thread: running
[   49.614459] thread: running
[   49.615063] thread: running
[   49.615647] thread: running
[   49.616248] thread: running
[   49.616834] thread: running
[   49.617438] thread: running
[   49.618156] thread: running
[   49.618737] thread: running
[   49.619877] thread: running
[   49.620562] exit_creds: called for tsk at 88003ae18000
[   49.621438] exit_creds: called for tsk at 88003ae1a300
[   49.622304] exit_creds: called for tsk at 88003ae1c600
[   49.623152] exit_creds: called for tsk at 88003ae1d780
[   49.623991] exit_creds: called for tsk at 88003ae1e900
[   49.624838] exit_creds: called for tsk at 88003ac96900
[   49.625698] exit_creds: called for tsk at 88003ac95780
[   49.626557] exit_creds: called for tsk at 88003ac94600
[   49.627412] exit_creds: called for tsk at 88003ac92300
[   49.628256] exit_creds: called for tsk at 88003ac91180
[   49.629118] exit_creds: called for tsk at 88003ac9
[   49.629957] exit_creds: called for tsk at 88003a9f
[   49.630804] exit_creds: called for tsk at 88003a9f1180
[   49.631687] exit_creds: called for tsk at 88003a9f2300
[   49.632539] exit_creds: called for tsk at 88003a9f3480
[   49.633399] exit_creds: called for tsk at 88003a9f4600
[   49.635447] exit_creds: called for tsk at 88003a9f5780
[   49.636686] exit_creds: called for tsk at 88003a9f6900
[   49.638128] exit_creds: called for tsk at 

kthreads: sporadic NULL pointer dereference in exit_creds()

2015-08-12 Thread Phil Sutter
[please keep me on Cc: since I am not subscribed to this list]

Hi,

While enhancing lib/test_rhashtable.c by a few threads to provoke
concurrency issues, I encountered a bug in the kernel's cleanup routines
for kthreads. Upon calling kthread_stop(), it would occasionally call
exit_creds() for the same task_struct pointer twice, thereby crashing
the kernel in the second invocation due to dereferencing tsk-cred and
tsk-real_cred being NULL.

I managed to isolate the bug trigger into a small kernel module which:
1) creates a number of kernel threads
2) uses two semaphores to synchronise the thread startup and make the
   module init code wait for them to actually run before continuing
3) calls kthread_stop() for all started threads.

I could successfully verify the problem using Linus's tree at commit
58ccab91342c1cc1fe08da9b198ac5d763706c2e (4.2.0-rc6) on a qemu machine
with only a single virtual CPU. In order to analyze the problem, I've
added a printk() to exit_creds():

---8---
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -152,6 +152,7 @@ void exit_creds(struct task_struct *tsk)
 {
struct cred *cred;
 
+   printk(KERN_INFO %s: called for tsk at %p\n, __func__, tsk);
kdebug(exit_creds(%u,%p,%p,{%d,%d}), tsk-pid, tsk-real_cred, 
tsk-cred,
   atomic_read(tsk-cred-usage),
   read_cred_subscribers(tsk-cred));
---8---

Here is the kernel log after calling 'insmod ./kthread_test.ko tcount=20':

---8---
[   49.566682] kthread_test: module verification failed: signature and/or 
required key missing - tainting kernel
[   49.570457] init_kthread_test: starting test run
[   49.576535] init_kthread_test: started thread at 88003ae18000
[   49.577552] init_kthread_test: started thread at 88003ae1a300
[   49.578585] init_kthread_test: started thread at 88003ae1c600
[   49.579580] init_kthread_test: started thread at 88003ae1d780
[   49.580565] init_kthread_test: started thread at 88003ae1e900
[   49.581553] init_kthread_test: started thread at 88003ac96900
[   49.582604] init_kthread_test: started thread at 88003ac95780
[   49.583607] init_kthread_test: started thread at 88003ac94600
[   49.584766] init_kthread_test: started thread at 88003ac93480
[   49.585783] init_kthread_test: started thread at 88003ac92300
[   49.586775] init_kthread_test: started thread at 88003ac91180
[   49.587772] init_kthread_test: started thread at 88003ac9
[   49.588763] init_kthread_test: started thread at 88003a9f
[   49.589760] init_kthread_test: started thread at 88003a9f1180
[   49.590755] init_kthread_test: started thread at 88003a9f2300
[   49.591747] init_kthread_test: started thread at 88003a9f3480
[   49.592742] init_kthread_test: started thread at 88003a9f4600
[   49.593731] init_kthread_test: started thread at 88003a9f5780
[   49.594722] init_kthread_test: started thread at 88003a9f6900
[   49.595719] init_kthread_test: started thread at 88003aa18000
[   49.596711] thread: running
[   49.597329] thread: running
[   49.597918] thread: running
[   49.598553] thread: running
[   49.599166] thread: running
[   49.599755] thread: running
[   49.600356] thread: running
[   49.601161] thread: running
[   49.602028] thread: running
[   49.613245] thread: running
[   49.613850] thread: running
[   49.614459] thread: running
[   49.615063] thread: running
[   49.615647] thread: running
[   49.616248] thread: running
[   49.616834] thread: running
[   49.617438] thread: running
[   49.618156] thread: running
[   49.618737] thread: running
[   49.619877] thread: running
[   49.620562] exit_creds: called for tsk at 88003ae18000
[   49.621438] exit_creds: called for tsk at 88003ae1a300
[   49.622304] exit_creds: called for tsk at 88003ae1c600
[   49.623152] exit_creds: called for tsk at 88003ae1d780
[   49.623991] exit_creds: called for tsk at 88003ae1e900
[   49.624838] exit_creds: called for tsk at 88003ac96900
[   49.625698] exit_creds: called for tsk at 88003ac95780
[   49.626557] exit_creds: called for tsk at 88003ac94600
[   49.627412] exit_creds: called for tsk at 88003ac92300
[   49.628256] exit_creds: called for tsk at 88003ac91180
[   49.629118] exit_creds: called for tsk at 88003ac9
[   49.629957] exit_creds: called for tsk at 88003a9f
[   49.630804] exit_creds: called for tsk at 88003a9f1180
[   49.631687] exit_creds: called for tsk at 88003a9f2300
[   49.632539] exit_creds: called for tsk at 88003a9f3480
[   49.633399] exit_creds: called for tsk at 88003a9f4600
[   49.635447] exit_creds: called for tsk at 88003a9f5780
[   49.636686] exit_creds: called for tsk at 88003a9f6900
[   49.638128] exit_creds: called for tsk at 88003ac92300

Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Phil Sutter
On Fri, Jul 17, 2015 at 12:26:36PM +0200, Phil Sutter wrote:
> On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
> > On 07/02/15 at 10:09pm, Meelis Roos wrote:
> > > [   33.425061] Running rhashtable test nelem=8, max_size=65536, 
> > > shrinking=0
> > > [   33.425154] Test 00:
> > > [   33.534470]   Adding 5 keys
> > > [   34.743553] Info: encountered resize
> > > [   34.743698] Info: encountered resize
> > > [   34.743838] Info: encountered resize
> > > [   34.744057] Info: encountered resize
> > > [   34.744430] Info: encountered resize
> > > [   34.745139] Info: encountered resize
> > > [   34.746441] Info: encountered resize
> > > [   34.749055] Info: encountered resize
> > > [   34.754469] Info: encountered resize
> > > [   34.764836] Info: encountered resize
> > > [   34.785696] Info: encountered resize
> > > [   34.827448] Info: encountered resize
> > > [   34.896936]   Traversal complete: counted=49993, nelems=5, 
> > > entries=5, table-jumps=12
> > > [   34.897056] Test failed: Total count mismatch ^^^
> > 
> > I do see count mismatches as well due to the design of the walker
> > which restarts and thus sees certain entries multiple times.
> > 
> > Do you have this commit as well?
> > 
> > Author: Phil Sutter 
> > Date:   Mon Jul 6 15:51:20 2015 +0200
> > 
> > rhashtable: fix for resize events during table walk
> 
> Thomas, this should be resolved already. Meelis replied[1] to my patch,
> stating it fixes that problem for him. Though he's still waiting for
> your proposed patch to add a schedule() call so the kernel won't
> complain on his slow UltraSparc. :)

Ah, nevermind. You sent it already with him in Cc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Phil Sutter
On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
> On 07/02/15 at 10:09pm, Meelis Roos wrote:
> > [   33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0
> > [   33.425154] Test 00:
> > [   33.534470]   Adding 5 keys
> > [   34.743553] Info: encountered resize
> > [   34.743698] Info: encountered resize
> > [   34.743838] Info: encountered resize
> > [   34.744057] Info: encountered resize
> > [   34.744430] Info: encountered resize
> > [   34.745139] Info: encountered resize
> > [   34.746441] Info: encountered resize
> > [   34.749055] Info: encountered resize
> > [   34.754469] Info: encountered resize
> > [   34.764836] Info: encountered resize
> > [   34.785696] Info: encountered resize
> > [   34.827448] Info: encountered resize
> > [   34.896936]   Traversal complete: counted=49993, nelems=5, 
> > entries=5, table-jumps=12
> > [   34.897056] Test failed: Total count mismatch ^^^
> 
> I do see count mismatches as well due to the design of the walker
> which restarts and thus sees certain entries multiple times.
> 
> Do you have this commit as well?
> 
> Author: Phil Sutter 
> Date:   Mon Jul 6 15:51:20 2015 +0200
> 
> rhashtable: fix for resize events during table walk

Thomas, this should be resolved already. Meelis replied[1] to my patch,
stating it fixes that problem for him. Though he's still waiting for
your proposed patch to add a schedule() call so the kernel won't
complain on his slow UltraSparc. :)

Cheers, Phil

[1]: http://www.spinics.net/lists/netdev/msg335767.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Phil Sutter
On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
 On 07/02/15 at 10:09pm, Meelis Roos wrote:
  [   33.425061] Running rhashtable test nelem=8, max_size=65536, shrinking=0
  [   33.425154] Test 00:
  [   33.534470]   Adding 5 keys
  [   34.743553] Info: encountered resize
  [   34.743698] Info: encountered resize
  [   34.743838] Info: encountered resize
  [   34.744057] Info: encountered resize
  [   34.744430] Info: encountered resize
  [   34.745139] Info: encountered resize
  [   34.746441] Info: encountered resize
  [   34.749055] Info: encountered resize
  [   34.754469] Info: encountered resize
  [   34.764836] Info: encountered resize
  [   34.785696] Info: encountered resize
  [   34.827448] Info: encountered resize
  [   34.896936]   Traversal complete: counted=49993, nelems=5, 
  entries=5, table-jumps=12
  [   34.897056] Test failed: Total count mismatch ^^^
 
 I do see count mismatches as well due to the design of the walker
 which restarts and thus sees certain entries multiple times.
 
 Do you have this commit as well?
 
 Author: Phil Sutter p...@nwl.cc
 Date:   Mon Jul 6 15:51:20 2015 +0200
 
 rhashtable: fix for resize events during table walk

Thomas, this should be resolved already. Meelis replied[1] to my patch,
stating it fixes that problem for him. Though he's still waiting for
your proposed patch to add a schedule() call so the kernel won't
complain on his slow UltraSparc. :)

Cheers, Phil

[1]: http://www.spinics.net/lists/netdev/msg335767.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 4.1 regression in resizable hashtable tests

2015-07-17 Thread Phil Sutter
On Fri, Jul 17, 2015 at 12:26:36PM +0200, Phil Sutter wrote:
 On Fri, Jul 17, 2015 at 10:04:56AM +0200, Thomas Graf wrote:
  On 07/02/15 at 10:09pm, Meelis Roos wrote:
   [   33.425061] Running rhashtable test nelem=8, max_size=65536, 
   shrinking=0
   [   33.425154] Test 00:
   [   33.534470]   Adding 5 keys
   [   34.743553] Info: encountered resize
   [   34.743698] Info: encountered resize
   [   34.743838] Info: encountered resize
   [   34.744057] Info: encountered resize
   [   34.744430] Info: encountered resize
   [   34.745139] Info: encountered resize
   [   34.746441] Info: encountered resize
   [   34.749055] Info: encountered resize
   [   34.754469] Info: encountered resize
   [   34.764836] Info: encountered resize
   [   34.785696] Info: encountered resize
   [   34.827448] Info: encountered resize
   [   34.896936]   Traversal complete: counted=49993, nelems=5, 
   entries=5, table-jumps=12
   [   34.897056] Test failed: Total count mismatch ^^^
  
  I do see count mismatches as well due to the design of the walker
  which restarts and thus sees certain entries multiple times.
  
  Do you have this commit as well?
  
  Author: Phil Sutter p...@nwl.cc
  Date:   Mon Jul 6 15:51:20 2015 +0200
  
  rhashtable: fix for resize events during table walk
 
 Thomas, this should be resolved already. Meelis replied[1] to my patch,
 stating it fixes that problem for him. Though he's still waiting for
 your proposed patch to add a schedule() call so the kernel won't
 complain on his slow UltraSparc. :)

Ah, nevermind. You sent it already with him in Cc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rhashtable: fix for resize events during table walk

2015-07-06 Thread Phil Sutter
On Mon, Jul 06, 2015 at 09:30:40PM +0800, Herbert Xu wrote:
> On Mon, Jul 06, 2015 at 02:01:42PM +0200, Phil Sutter wrote:
> > diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> > index a60a6d3..e36b94b 100644
> > --- a/lib/rhashtable.c
> > +++ b/lib/rhashtable.c
> > @@ -585,6 +585,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
> > struct bucket_table *tbl = iter->walker->tbl;
> > struct rhashtable *ht = iter->ht;
> > struct rhash_head *p = iter->p;
> > +   void *rc = NULL;
> >  
> > if (p) {
> > p = rht_dereference_bucket_rcu(p->next, tbl, iter->slot);
> > @@ -617,12 +618,12 @@ next:
> > if (iter->walker->tbl) {
> > iter->slot = 0;
> > iter->skip = 0;
> > -   return ERR_PTR(-EAGAIN);
> > +   rc = ERR_PTR(-EAGAIN);
> > }
> >  
> > iter->p = NULL;
> 
> I think a simpler fix would be to move "iter->p = NULL" before
> the if statement.

Done. Thanks for the review!

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] rhashtable: fix for resize events during table walk

2015-07-06 Thread Phil Sutter
If rhashtable_walk_next detects a resize operation in progress, it jumps
to the new table and continues walking that one. But it misses to drop
the reference to it's current item, leading it to continue traversing
the new table's bucket in which the current item is sorted into, and
after reaching that bucket's end continues traversing the new table's
second bucket instead of the first one, thereby potentially missing
items.

This fixes the rhashtable runtime test for me. Bug probably introduced
by Herbert Xu's patch eddee5ba ("rhashtable: Fix walker behaviour during
rehash") although not explicitly tested.

Fixes: eddee5ba ("rhashtable: Fix walker behaviour during rehash")
Signed-off-by: Phil Sutter 
---
Changes since v1:
- Use simplified solution suggested by Herbert Xu.
---
 lib/rhashtable.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a60a6d3..cc0c697 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -610,6 +610,8 @@ next:
iter->skip = 0;
}
 
+   iter->p = NULL;
+
/* Ensure we see any new tables. */
smp_rmb();
 
@@ -620,8 +622,6 @@ next:
return ERR_PTR(-EAGAIN);
}
 
-   iter->p = NULL;
-
return NULL;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] rhashtable: fix for resize events during table walk

2015-07-06 Thread Phil Sutter
If rhashtable_walk_next detects a resize operation in progress, it jumps
to the new table and continues walking that one. But it misses to drop
the reference to it's current item, leading it to continue traversing
the new table's bucket in which the current item is sorted into, and
after reaching that bucket's end continues traversing the new table's
second bucket instead of the first one, thereby potentially missing
items.

This fixes the rhashtable runtime test for me. Bug probably introduced
by Herbert Xu's patch eddee5ba ("rhashtable: Fix walker behaviour during
rehash") although not explicitly tested.

Fixes: eddee5ba ("rhashtable: Fix walker behaviour during rehash")
Signed-off-by: Phil Sutter 
---
 lib/rhashtable.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a60a6d3..e36b94b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -585,6 +585,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
struct bucket_table *tbl = iter->walker->tbl;
struct rhashtable *ht = iter->ht;
struct rhash_head *p = iter->p;
+   void *rc = NULL;
 
if (p) {
p = rht_dereference_bucket_rcu(p->next, tbl, iter->slot);
@@ -617,12 +618,12 @@ next:
if (iter->walker->tbl) {
iter->slot = 0;
iter->skip = 0;
-   return ERR_PTR(-EAGAIN);
+   rc = ERR_PTR(-EAGAIN);
}
 
iter->p = NULL;
 
-   return NULL;
+   return rc;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] rhashtable: fix for resize events during table walk

2015-07-06 Thread Phil Sutter
If rhashtable_walk_next detects a resize operation in progress, it jumps
to the new table and continues walking that one. But it misses to drop
the reference to it's current item, leading it to continue traversing
the new table's bucket in which the current item is sorted into, and
after reaching that bucket's end continues traversing the new table's
second bucket instead of the first one, thereby potentially missing
items.

This fixes the rhashtable runtime test for me. Bug probably introduced
by Herbert Xu's patch eddee5ba (rhashtable: Fix walker behaviour during
rehash) although not explicitly tested.

Fixes: eddee5ba (rhashtable: Fix walker behaviour during rehash)
Signed-off-by: Phil Sutter p...@nwl.cc
---
 lib/rhashtable.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a60a6d3..e36b94b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -585,6 +585,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
struct bucket_table *tbl = iter-walker-tbl;
struct rhashtable *ht = iter-ht;
struct rhash_head *p = iter-p;
+   void *rc = NULL;
 
if (p) {
p = rht_dereference_bucket_rcu(p-next, tbl, iter-slot);
@@ -617,12 +618,12 @@ next:
if (iter-walker-tbl) {
iter-slot = 0;
iter-skip = 0;
-   return ERR_PTR(-EAGAIN);
+   rc = ERR_PTR(-EAGAIN);
}
 
iter-p = NULL;
 
-   return NULL;
+   return rc;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] rhashtable: fix for resize events during table walk

2015-07-06 Thread Phil Sutter
If rhashtable_walk_next detects a resize operation in progress, it jumps
to the new table and continues walking that one. But it misses to drop
the reference to it's current item, leading it to continue traversing
the new table's bucket in which the current item is sorted into, and
after reaching that bucket's end continues traversing the new table's
second bucket instead of the first one, thereby potentially missing
items.

This fixes the rhashtable runtime test for me. Bug probably introduced
by Herbert Xu's patch eddee5ba (rhashtable: Fix walker behaviour during
rehash) although not explicitly tested.

Fixes: eddee5ba (rhashtable: Fix walker behaviour during rehash)
Signed-off-by: Phil Sutter p...@nwl.cc
---
Changes since v1:
- Use simplified solution suggested by Herbert Xu.
---
 lib/rhashtable.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a60a6d3..cc0c697 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -610,6 +610,8 @@ next:
iter-skip = 0;
}
 
+   iter-p = NULL;
+
/* Ensure we see any new tables. */
smp_rmb();
 
@@ -620,8 +622,6 @@ next:
return ERR_PTR(-EAGAIN);
}
 
-   iter-p = NULL;
-
return NULL;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rhashtable: fix for resize events during table walk

2015-07-06 Thread Phil Sutter
On Mon, Jul 06, 2015 at 09:30:40PM +0800, Herbert Xu wrote:
 On Mon, Jul 06, 2015 at 02:01:42PM +0200, Phil Sutter wrote:
  diff --git a/lib/rhashtable.c b/lib/rhashtable.c
  index a60a6d3..e36b94b 100644
  --- a/lib/rhashtable.c
  +++ b/lib/rhashtable.c
  @@ -585,6 +585,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
  struct bucket_table *tbl = iter-walker-tbl;
  struct rhashtable *ht = iter-ht;
  struct rhash_head *p = iter-p;
  +   void *rc = NULL;
   
  if (p) {
  p = rht_dereference_bucket_rcu(p-next, tbl, iter-slot);
  @@ -617,12 +618,12 @@ next:
  if (iter-walker-tbl) {
  iter-slot = 0;
  iter-skip = 0;
  -   return ERR_PTR(-EAGAIN);
  +   rc = ERR_PTR(-EAGAIN);
  }
   
  iter-p = NULL;
 
 I think a simpler fix would be to move iter-p = NULL before
 the if statement.

Done. Thanks for the review!

Cheers, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tun: orphan an skb on tx

2015-02-02 Thread Phil Sutter
Hi,

On Mon, Feb 02, 2015 at 07:27:10AM +, David Woodhouse wrote:
> On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote:
> > From: David Woodhouse 
> > Date: Sun, 01 Feb 2015 21:29:43 +
> > 
> > > I really was looking for some way to push down something like an XFRM
> > > state into the tun device and just say "shove them out here until I tell
> > > you otherwise".
> > 
> > People decided to use TUN and push VPN stuff back into userspace,
> > and there are repercussions for that decision.
> > 
> > I'm not saying this to be mean or whatever, but I was very
> > disappointed when userland IPSEC solutions using TUN started showing
> > up.
> 
> Yeah. That's a valid criticism of vpnc, certainly. I never did
> understand why it reimplemented the IPSec stack.
> 
> For my OpenConnect client it's somewhat more justified though — the
> initial data transport there is over TLS, which the kernel doesn't
> support. And if we *can* establish UDP communication, that's over DTLS
> which the kernel doesn't support either. It's not even the *standard*
> version of DTLS because Cisco are still using a pre-RFC4347 version of
> the protocol. And we also need to probe the UDP connectivity and do
> keepalives and manage the fallback to using the TCP data transport.
> 
> It's not like vpnc where it really is just a case of setting up the ESP
> context and letting it run.
> 
> It's only now I've added Juniper support, which uses ESP-in-UDP for the
> data transport, that I'm doing something that the kernel supports at
> all. And now I'm looking at how to make use of that.
> 
> > We might as well have not have implemented the IPSEC stack at all,
> > because as a result of the userland VPN stuff our IPSEC stack is
> > largely unused except by a very narrow group of users.
> 
> Well, I'd love to make better use of it if I can. I do suspect it makes
> most sense for userspace to continue to manage the probing of UDP
> connectivity, and the fallback to TCP mode — and I suspect it also makes
> sense to continue to use tun for passing packets up to the VPN client
> when it's using the TCP transport.
> 
> So the question would be how we handle redirecting the packet flow to
> the optional UDP transport, when the VPN client determines that it's
> available. For the sake of the user setting up firewall and routing
> rules, I do think it's important that it continues to appear to
> userspace as the *same* device for the entire lifetime of the session,
> regardless of which transport the packets happen to be using at a given
> moment in time. It doesn't *have* to be tun, though. 

Since you want to provide connectivity over HTTPS which is not possible
in kernel space, you are stuck with keeping the tun device. So the
packet flow in that case is identical to how e.g. OpenVPN does it:

- tunX holds default route
- OpenConnect then:
  - receives packets on /dev/tun
  - holds TCP socket to VPN concentrator
  - does encapsulation into TLS

Speaking of optimisation, the interesting part is the alternative flow
via IPsec in UDP. AFAICT, it should be possible to setup an ESP in UDP
tunnel using XFRM (see ip-xfrm(8) for reference), although I didn't try
that myself. The funny thing with XFRM is, it applies before the routing
decision does: If my IPsec policy matches, the packet goes that way no
matter what the routing table says about the original destination. This
can be used to override the default route provided via tun0 in the above
case.

Of course, OpenConnect has to manage all the XFRM/policy stuff on it's
own, since switching from ESP in UDP back to TLS would mean to tear down
the XFRM tunnel. OpenConnect would have to setup (a limited) XFRM and
send test traffic to decide whether to set it up fully (if limited) or
tear it down (if unlimited) again so traffic arrives at tunX again.

In my opinion, this might work. The whole setup is probably about as
intuitive as the fact that kernel IPsec tunnel mode does not naturally
provide an own interface. Firewall setup on top of that might become a
matter of try-and-error. Maybe having a VTI interface and merely moving
the default route instead of fiddling with policies all the time might
make things a little easier to comprehend, but surely adds some
performance overhead.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tun: orphan an skb on tx

2015-02-02 Thread Phil Sutter
Hi,

On Mon, Feb 02, 2015 at 07:27:10AM +, David Woodhouse wrote:
 On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote:
  From: David Woodhouse dw...@infradead.org
  Date: Sun, 01 Feb 2015 21:29:43 +
  
   I really was looking for some way to push down something like an XFRM
   state into the tun device and just say shove them out here until I tell
   you otherwise.
  
  People decided to use TUN and push VPN stuff back into userspace,
  and there are repercussions for that decision.
  
  I'm not saying this to be mean or whatever, but I was very
  disappointed when userland IPSEC solutions using TUN started showing
  up.
 
 Yeah. That's a valid criticism of vpnc, certainly. I never did
 understand why it reimplemented the IPSec stack.
 
 For my OpenConnect client it's somewhat more justified though — the
 initial data transport there is over TLS, which the kernel doesn't
 support. And if we *can* establish UDP communication, that's over DTLS
 which the kernel doesn't support either. It's not even the *standard*
 version of DTLS because Cisco are still using a pre-RFC4347 version of
 the protocol. And we also need to probe the UDP connectivity and do
 keepalives and manage the fallback to using the TCP data transport.
 
 It's not like vpnc where it really is just a case of setting up the ESP
 context and letting it run.
 
 It's only now I've added Juniper support, which uses ESP-in-UDP for the
 data transport, that I'm doing something that the kernel supports at
 all. And now I'm looking at how to make use of that.
 
  We might as well have not have implemented the IPSEC stack at all,
  because as a result of the userland VPN stuff our IPSEC stack is
  largely unused except by a very narrow group of users.
 
 Well, I'd love to make better use of it if I can. I do suspect it makes
 most sense for userspace to continue to manage the probing of UDP
 connectivity, and the fallback to TCP mode — and I suspect it also makes
 sense to continue to use tun for passing packets up to the VPN client
 when it's using the TCP transport.
 
 So the question would be how we handle redirecting the packet flow to
 the optional UDP transport, when the VPN client determines that it's
 available. For the sake of the user setting up firewall and routing
 rules, I do think it's important that it continues to appear to
 userspace as the *same* device for the entire lifetime of the session,
 regardless of which transport the packets happen to be using at a given
 moment in time. It doesn't *have* to be tun, though. 

Since you want to provide connectivity over HTTPS which is not possible
in kernel space, you are stuck with keeping the tun device. So the
packet flow in that case is identical to how e.g. OpenVPN does it:

- tunX holds default route
- OpenConnect then:
  - receives packets on /dev/tun
  - holds TCP socket to VPN concentrator
  - does encapsulation into TLS

Speaking of optimisation, the interesting part is the alternative flow
via IPsec in UDP. AFAICT, it should be possible to setup an ESP in UDP
tunnel using XFRM (see ip-xfrm(8) for reference), although I didn't try
that myself. The funny thing with XFRM is, it applies before the routing
decision does: If my IPsec policy matches, the packet goes that way no
matter what the routing table says about the original destination. This
can be used to override the default route provided via tun0 in the above
case.

Of course, OpenConnect has to manage all the XFRM/policy stuff on it's
own, since switching from ESP in UDP back to TLS would mean to tear down
the XFRM tunnel. OpenConnect would have to setup (a limited) XFRM and
send test traffic to decide whether to set it up fully (if limited) or
tear it down (if unlimited) again so traffic arrives at tunX again.

In my opinion, this might work. The whole setup is probably about as
intuitive as the fact that kernel IPsec tunnel mode does not naturally
provide an own interface. Firewall setup on top of that might become a
matter of try-and-error. Maybe having a VTI interface and merely moving
the default route instead of fiddling with policies all the time might
make things a little easier to comprehend, but surely adds some
performance overhead.

Cheers, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Aug 7

2013-08-09 Thread Phil Sutter
On Wed, Aug 07, 2013 at 04:36:21PM -0700, David Miller wrote:
> From: David Miller 
> Date: Wed, 07 Aug 2013 16:27:48 -0700 (PDT)
> 
> > Look, I'm going to fix this myself, because I'm pretty tired of
> > waiting for the obvious fix.
> 
> Someone please test this:
> 
> diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
> index c623861..afc02a6 100644
> --- a/include/linux/etherdevice.h
> +++ b/include/linux/etherdevice.h
> @@ -29,6 +29,7 @@
>  
>  #ifdef __KERNEL__
>  extern __be16eth_type_trans(struct sk_buff *skb, struct 
> net_device *dev);
> +extern __be16__eth_type_trans(struct sk_buff *skb, struct 
> net_device *dev);
>  extern const struct header_ops eth_header_ops;
>  
>  extern int eth_header(struct sk_buff *skb, struct net_device *dev,
> diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
> index be1f64d..35dc1be 100644
> --- a/net/ethernet/eth.c
> +++ b/net/ethernet/eth.c
> @@ -146,6 +146,45 @@ int eth_rebuild_header(struct sk_buff *skb)
>  EXPORT_SYMBOL(eth_rebuild_header);
>  
>  /**
> + * __eth_type_trans - only determine the packet's protocol ID.
> + * @skb: packet
> + * @dev: device
> + */
> +__be16 __eth_type_trans(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct ethhdr *eth = (struct ethhdr *) skb->data;
> +
> + /*
> +  * Some variants of DSA tagging don't have an ethertype field
> +  * at all, so we check here whether one of those tagging
> +  * variants has been configured on the receiving interface,
> +  * and if so, set skb->protocol without looking at the packet.
> +  */
> + if (netdev_uses_dsa_tags(dev))
> + return htons(ETH_P_DSA);
> + if (netdev_uses_trailer_tags(dev))
> + return htons(ETH_P_TRAILER);
> +
> + if (ntohs(eth->h_proto) >= ETH_P_802_3_MIN)
> + return eth->h_proto;
> +
> + /*
> +  *  This is a magic hack to spot IPX packets. Older Novell breaks
> +  *  the protocol design and runs IPX over 802.3 without an 802.2 LLC
> +  *  layer. We look for  which isn't a used 802.2 SSAP/DSAP. This
> +  *  won't work for fault tolerant netware but does for the rest.
> +  */
> + if (skb->len >= 2 && *(unsigned short *)(skb->data) == 0x)
> + return htons(ETH_P_802_3);
> +
> + /*
> +  *  Real 802.2 LLC
> +  */
> + return htons(ETH_P_802_2);
> +}
> +EXPORT_SYMBOL(__eth_type_trans);
> +
> +/**
>   * eth_type_trans - determine the packet's protocol ID.
>   * @skb: received socket data
>   * @dev: receiving network device
> @@ -184,33 +223,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct 
> net_device *dev)
>   skb->pkt_type = PACKET_OTHERHOST;
>   }
>  
> - /*
> -  * Some variants of DSA tagging don't have an ethertype field
> -  * at all, so we check here whether one of those tagging
> -  * variants has been configured on the receiving interface,
> -  * and if so, set skb->protocol without looking at the packet.
> -  */
> - if (netdev_uses_dsa_tags(dev))
> - return htons(ETH_P_DSA);
> - if (netdev_uses_trailer_tags(dev))
> - return htons(ETH_P_TRAILER);
> -
> - if (ntohs(eth->h_proto) >= ETH_P_802_3_MIN)
> - return eth->h_proto;
> -
> - /*
> -  *  This is a magic hack to spot IPX packets. Older Novell breaks
> -  *  the protocol design and runs IPX over 802.3 without an 802.2 LLC
> -  *  layer. We look for  which isn't a used 802.2 SSAP/DSAP. This
> -  *  won't work for fault tolerant netware but does for the rest.
> -  */
> - if (skb->len >= 2 && *(unsigned short *)(skb->data) == 0x)
> - return htons(ETH_P_802_3);
> -
> - /*
> -  *  Real 802.2 LLC
> -  */
> - return htons(ETH_P_802_2);
> + return __eth_type_trans(skb, dev);
>  }
>  EXPORT_SYMBOL(eth_type_trans);
>  
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 0c0f6c9..ec8e1c3 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2003,7 +2003,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
> struct sk_buff *skb,
>   return err;
>  
>   if (dev->type == ARPHRD_ETHER)
> - skb->protocol = eth_type_trans(skb, dev);
> + skb->protocol = __eth_type_trans(skb, dev);
>  
>   data += dev->hard_header_len;
>   to_write -= dev->hard_header_len;
> @@ -2332,13 +2332,13 @@ static int packet_snd(struct socket *sock,
>   sock_tx_timestamp(sk, _shinfo(skb)->tx_flags);
>  
>   if (dev->type == ARPHRD_ETHER) {
> - skb->protocol = eth_type_trans(skb, dev);
> + skb->protocol = __eth_type_trans(skb, dev);
>   if (skb->protocol == htons(ETH_P_8021Q))
>   reserve += VLAN_HLEN;
>   } else {
>   skb->protocol = proto;
> -   

Re: linux-next: Tree for Aug 7

2013-08-09 Thread Phil Sutter
On Wed, Aug 07, 2013 at 04:36:21PM -0700, David Miller wrote:
 From: David Miller da...@davemloft.net
 Date: Wed, 07 Aug 2013 16:27:48 -0700 (PDT)
 
  Look, I'm going to fix this myself, because I'm pretty tired of
  waiting for the obvious fix.
 
 Someone please test this:
 
 diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
 index c623861..afc02a6 100644
 --- a/include/linux/etherdevice.h
 +++ b/include/linux/etherdevice.h
 @@ -29,6 +29,7 @@
  
  #ifdef __KERNEL__
  extern __be16eth_type_trans(struct sk_buff *skb, struct 
 net_device *dev);
 +extern __be16__eth_type_trans(struct sk_buff *skb, struct 
 net_device *dev);
  extern const struct header_ops eth_header_ops;
  
  extern int eth_header(struct sk_buff *skb, struct net_device *dev,
 diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
 index be1f64d..35dc1be 100644
 --- a/net/ethernet/eth.c
 +++ b/net/ethernet/eth.c
 @@ -146,6 +146,45 @@ int eth_rebuild_header(struct sk_buff *skb)
  EXPORT_SYMBOL(eth_rebuild_header);
  
  /**
 + * __eth_type_trans - only determine the packet's protocol ID.
 + * @skb: packet
 + * @dev: device
 + */
 +__be16 __eth_type_trans(struct sk_buff *skb, struct net_device *dev)
 +{
 + struct ethhdr *eth = (struct ethhdr *) skb-data;
 +
 + /*
 +  * Some variants of DSA tagging don't have an ethertype field
 +  * at all, so we check here whether one of those tagging
 +  * variants has been configured on the receiving interface,
 +  * and if so, set skb-protocol without looking at the packet.
 +  */
 + if (netdev_uses_dsa_tags(dev))
 + return htons(ETH_P_DSA);
 + if (netdev_uses_trailer_tags(dev))
 + return htons(ETH_P_TRAILER);
 +
 + if (ntohs(eth-h_proto) = ETH_P_802_3_MIN)
 + return eth-h_proto;
 +
 + /*
 +  *  This is a magic hack to spot IPX packets. Older Novell breaks
 +  *  the protocol design and runs IPX over 802.3 without an 802.2 LLC
 +  *  layer. We look for  which isn't a used 802.2 SSAP/DSAP. This
 +  *  won't work for fault tolerant netware but does for the rest.
 +  */
 + if (skb-len = 2  *(unsigned short *)(skb-data) == 0x)
 + return htons(ETH_P_802_3);
 +
 + /*
 +  *  Real 802.2 LLC
 +  */
 + return htons(ETH_P_802_2);
 +}
 +EXPORT_SYMBOL(__eth_type_trans);
 +
 +/**
   * eth_type_trans - determine the packet's protocol ID.
   * @skb: received socket data
   * @dev: receiving network device
 @@ -184,33 +223,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct 
 net_device *dev)
   skb-pkt_type = PACKET_OTHERHOST;
   }
  
 - /*
 -  * Some variants of DSA tagging don't have an ethertype field
 -  * at all, so we check here whether one of those tagging
 -  * variants has been configured on the receiving interface,
 -  * and if so, set skb-protocol without looking at the packet.
 -  */
 - if (netdev_uses_dsa_tags(dev))
 - return htons(ETH_P_DSA);
 - if (netdev_uses_trailer_tags(dev))
 - return htons(ETH_P_TRAILER);
 -
 - if (ntohs(eth-h_proto) = ETH_P_802_3_MIN)
 - return eth-h_proto;
 -
 - /*
 -  *  This is a magic hack to spot IPX packets. Older Novell breaks
 -  *  the protocol design and runs IPX over 802.3 without an 802.2 LLC
 -  *  layer. We look for  which isn't a used 802.2 SSAP/DSAP. This
 -  *  won't work for fault tolerant netware but does for the rest.
 -  */
 - if (skb-len = 2  *(unsigned short *)(skb-data) == 0x)
 - return htons(ETH_P_802_3);
 -
 - /*
 -  *  Real 802.2 LLC
 -  */
 - return htons(ETH_P_802_2);
 + return __eth_type_trans(skb, dev);
  }
  EXPORT_SYMBOL(eth_type_trans);
  
 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
 index 0c0f6c9..ec8e1c3 100644
 --- a/net/packet/af_packet.c
 +++ b/net/packet/af_packet.c
 @@ -2003,7 +2003,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
 struct sk_buff *skb,
   return err;
  
   if (dev-type == ARPHRD_ETHER)
 - skb-protocol = eth_type_trans(skb, dev);
 + skb-protocol = __eth_type_trans(skb, dev);
  
   data += dev-hard_header_len;
   to_write -= dev-hard_header_len;
 @@ -2332,13 +2332,13 @@ static int packet_snd(struct socket *sock,
   sock_tx_timestamp(sk, skb_shinfo(skb)-tx_flags);
  
   if (dev-type == ARPHRD_ETHER) {
 - skb-protocol = eth_type_trans(skb, dev);
 + skb-protocol = __eth_type_trans(skb, dev);
   if (skb-protocol == htons(ETH_P_8021Q))
   reserve += VLAN_HLEN;
   } else {
   skb-protocol = proto;
 - skb-dev = dev;
   }
 + skb-dev = dev;
  
   if (!gso_type  (len  dev-mtu + reserve + extra_len)) {
   err = 

Re: linux-next: Tree for Aug 7

2013-08-07 Thread Phil Sutter
On Wed, Aug 07, 2013 at 10:47:13AM -0700, David Miller wrote:
> From: Eric Dumazet 
> Date: Wed, 07 Aug 2013 09:40:09 -0700
> 
> > On Wed, 2013-08-07 at 18:22 +0200, Johannes Berg wrote:
> > 
> >> Maybe. I haven't tested it, but I'm thinking that skb->data doesn't
> >> point to the start of the data frame in this case, since we now call
> >> eth_type_trans() which pulls the ethernet header. So if the device just
> >> transmits skb->len starting from skb->data, it'll be wrong, no? That
> >> seems a basic assumption though.
> > 
> > Yes, it seems calling eth_type_trans() is not right here, and even could
> > crash.
> > 
> > Sorry, for being vague, I am a bit busy this morning.
> 
> Yes, this is absolutely the core problem, you absolute cannot
> call eth_type_trans() on the output path, it pulls off the
> ethernet header from the packet.  That can't possibly work.
> 
> I want a real fix submitted formally for this problem immediately,
> or else I'm reverting all of these changes this afternoon.

One could simply call skb_push(skb, ETH_HLEN) right after calling
eth_type_trans(skb, dev) in order to undo the 'data' and 'len'
adjustment. Not sure if this kind of hack is the right way to go here,
or if the whole af_packet parses ethernet header discussion should be
opened again instead.

Best wishes, Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Aug 7

2013-08-07 Thread Phil Sutter
On Wed, Aug 07, 2013 at 10:29:18AM +0200, Sedat Dilek wrote:
> On Wed, Aug 7, 2013 at 7:54 AM, Stephen Rothwell  
> wrote:
> > Hi all,
> >
> > Changes since 20130806:
> >
> > The ext4 tree lost its build failure.
> >
> > The mvebu tree gained a build failure so I used the version from
> > next-20130806.
> >
> > The akpm tree gained conflicts against the ext4 tree.
> >
> > 
> >
> 
> [ CC some netdev and wireless folks ]
> 
> Yesterday, I discovered an issue with net-next.
> The patch in [1] fixed the problems in my network/wifi environment.
> Hannes confirmed that virtio_net are solved, too.
> Today's next-20130807 still needs it for affected people.
> 
> - Sedat -
> 
> [1] http://marc.info/?l=linux-netdev=137582524017840=2
> [2] http://marc.info/?l=linux-netdev=137583048219416=2
> [3] http://marc.info/?t=13757971288=1=2

Could you please try the attached patch. It limits parsing the ethernet
header (by calling eth_type_trans()) to cases when the configured
protocol is ETH_P_ALL, so at least for 802.1X this should fix the
problem.

The idea behind this patch is that users setting the protocol to
something else probably do know better and so should be left alone.

Best wishes, Phil
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index bbe1ece..66bc79c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1932,8 +1932,6 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
 
ph.raw = frame;
 
-   skb->protocol = proto;
-   skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
sock_tx_timestamp(>sk, _shinfo(skb)->tx_flags);
@@ -2002,13 +2000,18 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
if (unlikely(err))
return err;
 
-   if (dev->type == ARPHRD_ETHER)
-   skb->protocol = eth_type_trans(skb, dev);
-
data += dev->hard_header_len;
to_write -= dev->hard_header_len;
}
 
+   if (dev->type == ARPHRD_ETHER &&
+   proto = htons(ETH_P_ALL)) {
+   skb->protocol = eth_type_trans(skb, dev);
+   } else {
+   skb->protocol = proto;
+   skb->dev = dev;
+   }
+
max_frame_len = dev->mtu + dev->hard_header_len;
if (skb->protocol == htons(ETH_P_8021Q))
max_frame_len += VLAN_HLEN;
@@ -2331,15 +2334,17 @@ static int packet_snd(struct socket *sock,
 
sock_tx_timestamp(sk, _shinfo(skb)->tx_flags);
 
-   if (dev->type == ARPHRD_ETHER) {
+   if (dev->type == ARPHRD_ETHER &&
+   proto == htons(ETH_P_ALL)) {
skb->protocol = eth_type_trans(skb, dev);
-   if (skb->protocol == htons(ETH_P_8021Q))
-   reserve += VLAN_HLEN;
} else {
skb->protocol = proto;
skb->dev = dev;
}
 
+   if (skb->protocol == htons(ETH_P_8021Q))
+   reserve += VLAN_HLEN;
+
if (!gso_type && (len > dev->mtu + reserve + extra_len)) {
err = -EMSGSIZE;
goto out_free;


Re: linux-next: Tree for Aug 7

2013-08-07 Thread Phil Sutter
On Wed, Aug 07, 2013 at 10:29:18AM +0200, Sedat Dilek wrote:
 On Wed, Aug 7, 2013 at 7:54 AM, Stephen Rothwell s...@canb.auug.org.au 
 wrote:
  Hi all,
 
  Changes since 20130806:
 
  The ext4 tree lost its build failure.
 
  The mvebu tree gained a build failure so I used the version from
  next-20130806.
 
  The akpm tree gained conflicts against the ext4 tree.
 
  
 
 
 [ CC some netdev and wireless folks ]
 
 Yesterday, I discovered an issue with net-next.
 The patch in [1] fixed the problems in my network/wifi environment.
 Hannes confirmed that virtio_net are solved, too.
 Today's next-20130807 still needs it for affected people.
 
 - Sedat -
 
 [1] http://marc.info/?l=linux-netdevm=137582524017840w=2
 [2] http://marc.info/?l=linux-netdevm=137583048219416w=2
 [3] http://marc.info/?t=13757971288r=1w=2

Could you please try the attached patch. It limits parsing the ethernet
header (by calling eth_type_trans()) to cases when the configured
protocol is ETH_P_ALL, so at least for 802.1X this should fix the
problem.

The idea behind this patch is that users setting the protocol to
something else probably do know better and so should be left alone.

Best wishes, Phil
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index bbe1ece..66bc79c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1932,8 +1932,6 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
 
ph.raw = frame;
 
-   skb-protocol = proto;
-   skb-dev = dev;
skb-priority = po-sk.sk_priority;
skb-mark = po-sk.sk_mark;
sock_tx_timestamp(po-sk, skb_shinfo(skb)-tx_flags);
@@ -2002,13 +2000,18 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
if (unlikely(err))
return err;
 
-   if (dev-type == ARPHRD_ETHER)
-   skb-protocol = eth_type_trans(skb, dev);
-
data += dev-hard_header_len;
to_write -= dev-hard_header_len;
}
 
+   if (dev-type == ARPHRD_ETHER 
+   proto = htons(ETH_P_ALL)) {
+   skb-protocol = eth_type_trans(skb, dev);
+   } else {
+   skb-protocol = proto;
+   skb-dev = dev;
+   }
+
max_frame_len = dev-mtu + dev-hard_header_len;
if (skb-protocol == htons(ETH_P_8021Q))
max_frame_len += VLAN_HLEN;
@@ -2331,15 +2334,17 @@ static int packet_snd(struct socket *sock,
 
sock_tx_timestamp(sk, skb_shinfo(skb)-tx_flags);
 
-   if (dev-type == ARPHRD_ETHER) {
+   if (dev-type == ARPHRD_ETHER 
+   proto == htons(ETH_P_ALL)) {
skb-protocol = eth_type_trans(skb, dev);
-   if (skb-protocol == htons(ETH_P_8021Q))
-   reserve += VLAN_HLEN;
} else {
skb-protocol = proto;
skb-dev = dev;
}
 
+   if (skb-protocol == htons(ETH_P_8021Q))
+   reserve += VLAN_HLEN;
+
if (!gso_type  (len  dev-mtu + reserve + extra_len)) {
err = -EMSGSIZE;
goto out_free;


Re: linux-next: Tree for Aug 7

2013-08-07 Thread Phil Sutter
On Wed, Aug 07, 2013 at 10:47:13AM -0700, David Miller wrote:
 From: Eric Dumazet eric.duma...@gmail.com
 Date: Wed, 07 Aug 2013 09:40:09 -0700
 
  On Wed, 2013-08-07 at 18:22 +0200, Johannes Berg wrote:
  
  Maybe. I haven't tested it, but I'm thinking that skb-data doesn't
  point to the start of the data frame in this case, since we now call
  eth_type_trans() which pulls the ethernet header. So if the device just
  transmits skb-len starting from skb-data, it'll be wrong, no? That
  seems a basic assumption though.
  
  Yes, it seems calling eth_type_trans() is not right here, and even could
  crash.
  
  Sorry, for being vague, I am a bit busy this morning.
 
 Yes, this is absolutely the core problem, you absolute cannot
 call eth_type_trans() on the output path, it pulls off the
 ethernet header from the packet.  That can't possibly work.
 
 I want a real fix submitted formally for this problem immediately,
 or else I'm reverting all of these changes this afternoon.

One could simply call skb_push(skb, ETH_HLEN) right after calling
eth_type_trans(skb, dev) in order to undo the 'data' and 'len'
adjustment. Not sure if this kind of hack is the right way to go here,
or if the whole af_packet parses ethernet header discussion should be
opened again instead.

Best wishes, Phil
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/