Re: [PATCH 0/3] Fix some seq_file users that were recently broken

2021-02-07 Thread NeilBrown
On Fri, Feb 05 2021, Andrew Morton wrote:

> On Fri, 05 Feb 2021 11:36:30 +1100 NeilBrown  wrote:
>
>> A recent change to seq_file broke some users which were using seq_file
>> in a non-"standard" way ...  though the "standard" isn't documented, so
>> they can be excused.  The result is a possible leak - of memory in one
>> case, of references to a 'transport' in the other.
>> 
>> These three patches:
>>  1/ document and explain the problem
>>  2/ fix the problem user in x86
>>  3/ fix the problem user in net/sctp
>> 
>
> 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and
> interface") was August 2018, so I don't think "recent" applies here?

I must be getting old :-(

>
> I didn't look closely, but it appears that the sctp procfs file is
> world-readable.  So we gave unprivileged userspace the ability to leak
> kernel memory?

Not quite that bad.  The x86 problem allows arbitrary memory to be
leaked, but that is in debugfs (as I'm sure you saw) so is root-only.
The sctp one only allows an sctp_transport to be pinned.  That is not
good and would, e.g., prevent the module from being unloaded.  But
unlike a simple memory leak it won't affect anything outside of sctp.

>
> So I'm thinking that we aim for 5.12-rc1 on all three patches with a 
> cc:stable?

Thanks for looking after these!

NeilBrown


signature.asc
Description: PGP signature


[PATCH 3/3] net: fix iteration for sctp transport seq_files

2021-02-04 Thread NeilBrown
The sctp transport seq_file iterators take a reference to the transport
in the ->start and ->next functions and releases the reference in the
->show function.  The preferred handling for such resources is to
release them in the subsequent ->next or ->stop function call.

Since Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration
code and interface") there is no guarantee that ->show will be called
after ->next, so this function can now leak references.

So move the sctp_transport_put() call to ->next and ->stop.

Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and 
interface")
Reported-by: Xin Long 
Signed-off-by: NeilBrown 
---
 net/sctp/proc.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index f7da88ae20a5..982a87b3e11f 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -215,6 +215,12 @@ static void sctp_transport_seq_stop(struct seq_file *seq, 
void *v)
 {
struct sctp_ht_iter *iter = seq->private;
 
+   if (v && v != SEQ_START_TOKEN) {
+   struct sctp_transport *transport = v;
+
+   sctp_transport_put(transport);
+   }
+
sctp_transport_walk_stop(&iter->hti);
 }
 
@@ -222,6 +228,12 @@ static void *sctp_transport_seq_next(struct seq_file *seq, 
void *v, loff_t *pos)
 {
struct sctp_ht_iter *iter = seq->private;
 
+   if (v && v != SEQ_START_TOKEN) {
+   struct sctp_transport *transport = v;
+
+   sctp_transport_put(transport);
+   }
+
++*pos;
 
return sctp_transport_get_next(seq_file_net(seq), &iter->hti);
@@ -277,8 +289,6 @@ static int sctp_assocs_seq_show(struct seq_file *seq, void 
*v)
sk->sk_rcvbuf);
seq_printf(seq, "\n");
 
-   sctp_transport_put(transport);
-
return 0;
 }
 
@@ -354,8 +364,6 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, void 
*v)
seq_printf(seq, "\n");
}
 
-   sctp_transport_put(transport);
-
return 0;
 }
 




[PATCH 0/3] Fix some seq_file users that were recently broken

2021-02-04 Thread NeilBrown
A recent change to seq_file broke some users which were using seq_file
in a non-"standard" way ...  though the "standard" isn't documented, so
they can be excused.  The result is a possible leak - of memory in one
case, of references to a 'transport' in the other.

These three patches:
 1/ document and explain the problem
 2/ fix the problem user in x86
 3/ fix the problem user in net/sctp

I suspect the patches should each go through the relevant subsystems,
but I'm including akpm as the original went through him.

Thanks,
NeilBrown

---

NeilBrown (3):
  seq_file: document how per-entry resources are managed.
  x86: fix seq_file iteration for pat/memtype.c
  net: fix iteration for sctp transport seq_files

 Documentation/filesystems/seq_file.rst |  6 ++
 arch/x86/mm/pat/memtype.c  |  4 ++--
 net/sctp/proc.c| 16 
 3 files changed, 20 insertions(+), 6 deletions(-)

--
Signature



Re: [PATCH AUTOSEL 5.4 26/35] SUNRPC: defer slow parts of rpc_free_client() to a workqueue.

2020-05-07 Thread NeilBrown
On Thu, May 07 2020, Sasha Levin wrote:

> From: NeilBrown 
>
> [ Upstream commit 7c4310ff56422ea43418305d22bbc5fe19150ec4 ]

This one is buggy - it introduces a use-after-free.  Best delay it for
now.

NeilBrown

>
> The rpciod workqueue is on the write-out path for freeing dirty memory,
> so it is important that it never block waiting for memory to be
> allocated - this can lead to a deadlock.
>
> rpc_execute() - which is often called by an rpciod work item - calls
> rcp_task_release_client() which can lead to rpc_free_client().
>
> rpc_free_client() makes two calls which could potentially block wating
> for memory allocation.
>
> rpc_clnt_debugfs_unregister() calls into debugfs and will block while
> any of the debugfs files are being accessed.  In particular it can block
> while any of the 'open' methods are being called and all of these use
> malloc for one thing or another.  So this can deadlock if the memory
> allocation waits for NFS to complete some writes via rpciod.
>
> rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
> obvious that memory allocations can happen while the lock it held, it is
> safer to assume they might and to not let rpciod call
> rpc_clnt_remove_pipedir().
>
> So this patch moves these two calls (together with the final kfree() and
> rpciod_down()) into a work-item to be run from the system work-queue.
> rpciod can continue its important work, and the final stages of the free
> can happen whenever they happen.
>
> I have seen this deadlock on a 4.12 based kernel where debugfs used
> synchronize_srcu() when removing objects.  synchronize_srcu() requires a
> workqueue and there were no free workther threads and none could be
> allocated.  While debugsfs no longer uses SRCU, I believe the deadlock
> is still possible.
>
> Signed-off-by: NeilBrown 
> Signed-off-by: Trond Myklebust 
> Signed-off-by: Sasha Levin 
> ---
>  include/linux/sunrpc/clnt.h |  8 +++-
>  net/sunrpc/clnt.c   | 21 +
>  2 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index abc63bd1be2b5..d99d39d45a494 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -71,7 +71,13 @@ struct rpc_clnt {
>  #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>   struct dentry   *cl_debugfs;/* debugfs directory */
>  #endif
> - struct rpc_xprt_itercl_xpi;
> + /* cl_work is only needed after cl_xpi is no longer used,
> +  * and that are of similar size
> +  */
> + union {
> + struct rpc_xprt_itercl_xpi;
> + struct work_struct  cl_work;
> + };
>   const struct cred   *cl_cred;
>  };
>  
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index f7f78566be463..a7430b66c7389 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -877,6 +877,20 @@ EXPORT_SYMBOL_GPL(rpc_shutdown_client);
>  /*
>   * Free an RPC client
>   */
> +static void rpc_free_client_work(struct work_struct *work)
> +{
> + struct rpc_clnt *clnt = container_of(work, struct rpc_clnt, cl_work);
> +
> + /* These might block on processes that might allocate memory,
> +  * so they cannot be called in rpciod, so they are handled separately
> +  * here.
> +  */
> + rpc_clnt_debugfs_unregister(clnt);
> + rpc_clnt_remove_pipedir(clnt);
> +
> + kfree(clnt);
> + rpciod_down();
> +}
>  static struct rpc_clnt *
>  rpc_free_client(struct rpc_clnt *clnt)
>  {
> @@ -887,17 +901,16 @@ rpc_free_client(struct rpc_clnt *clnt)
>   rcu_dereference(clnt->cl_xprt)->servername);
>   if (clnt->cl_parent != clnt)
>   parent = clnt->cl_parent;
> - rpc_clnt_debugfs_unregister(clnt);
> - rpc_clnt_remove_pipedir(clnt);
>   rpc_unregister_client(clnt);
>   rpc_free_iostats(clnt->cl_metrics);
>   clnt->cl_metrics = NULL;
>   xprt_put(rcu_dereference_raw(clnt->cl_xprt));
>   xprt_iter_destroy(&clnt->cl_xpi);
> - rpciod_down();
>   put_cred(clnt->cl_cred);
>   rpc_free_clid(clnt);
> - kfree(clnt);
> +
> + INIT_WORK(&clnt->cl_work, rpc_free_client_work);
> + schedule_work(&clnt->cl_work);
>   return parent;
>  }
>  
> -- 
> 2.20.1


signature.asc
Description: PGP signature


Re: [PATCH net] rhashtable: fix sparse RCU warnings on bit lock in bucket pointer

2019-05-15 Thread NeilBrown
On Wed, May 15 2019, Jakub Kicinski wrote:

> Since the bit_spin_lock() operations don't actually dereference
> the pointer, it's fine to forcefully drop the RCU annotation.
> This fixes 7 sparse warnings per include site.
>
> Fixes: 8f0db018006a ("rhashtable: use bit_spin_locks to protect hash bucket.")
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 

Hi, sorry for not responding to your initial post, but I'm otherwise
engaged this week and cannot give it any real time.  I don't object to
this patch, but I'll try to have a proper look next week, if only to
find out how I didn't get the warnings, as I was testing with sparse.

Thanks,
NeilBrown


> ---
>  include/linux/rhashtable.h | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index f7714d3b46bd..bea1e0440ab4 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -325,27 +325,27 @@ static inline struct rhash_lock_head __rcu 
> **rht_bucket_insert(
>   */
>  
>  static inline void rht_lock(struct bucket_table *tbl,
> - struct rhash_lock_head **bkt)
> + struct rhash_lock_head __rcu **bkt)
>  {
>   local_bh_disable();
> - bit_spin_lock(0, (unsigned long *)bkt);
> + bit_spin_lock(0, (unsigned long __force *)bkt);
>   lock_map_acquire(&tbl->dep_map);
>  }
>  
>  static inline void rht_lock_nested(struct bucket_table *tbl,
> -struct rhash_lock_head **bucket,
> +struct rhash_lock_head __rcu **bkt,
>  unsigned int subclass)
>  {
>   local_bh_disable();
> - bit_spin_lock(0, (unsigned long *)bucket);
> + bit_spin_lock(0, (unsigned long __force *)bkt);
>   lock_acquire_exclusive(&tbl->dep_map, subclass, 0, NULL, _THIS_IP_);
>  }
>  
>  static inline void rht_unlock(struct bucket_table *tbl,
> -   struct rhash_lock_head **bkt)
> +   struct rhash_lock_head __rcu **bkt)
>  {
>   lock_map_release(&tbl->dep_map);
> - bit_spin_unlock(0, (unsigned long *)bkt);
> + bit_spin_unlock(0, (unsigned long __force *)bkt);
>   local_bh_enable();
>  }
>  
> -- 
> 2.21.0


signature.asc
Description: PGP signature


RE: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

2019-04-03 Thread NeilBrown
On Wed, Apr 03 2019, David Laight wrote:

> From: NeilBrown
>> Sent: 02 April 2019 22:11
>> 
>> On Tue, Apr 02 2019, David Laight wrote:
>> 
>> > From: NeilBrown
>> >> Sent: 02 April 2019 00:08
>> >>
>> >> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
>> >> bucket pointer to lock the hash chain for that bucket.
>> > ...
>> >> To enhance type checking, a new struct is introduced to represent the
>> >>   pointer plus lock-bit
>> >> that is stored in the bucket-table.  This is "struct rhash_lock_head"
>> >> and is empty.  A pointer to this needs to be cast to either an
>> >> unsigned lock, or a "struct rhash_head *" to be useful.
>> >> Variables of this type are most often called "bkt".
>> >
>> > Did you try using a union of the pointer and an 'unsigned long' ?
>> > Should remove a lot of the casts.
>> 
>> It might, but I'm not sure it is what we want.
>> The value is not an unsigned long OR a pointer, it is both blended
>> together.  So it really isn't a union.
>> We *want* it to require casts to access, so that it is clear that
>> something unusual is happening, and care is needed.
>
> Right, but you also want to make it hard to forget to do it properly.
> Using a union to access the memory as either a pointer or a long
> is perfectly valid (and is valid with 'strict aliasing' enabled).
> (Rather than the other use of a union to just save space.)

Agreed I personally think that a union make it easy to forget.

>
> An interesting thought
> Are the only valid actions 'lock and read, and 'unlock with optional update' ?

No, there is also "read without locking" - use for lookups with RCU
protection.  But yes: the set of valid actions is quite limited.

> If so there are only 2 bits of code that need to understand the encoding.
> If you make the bit number(s) and polarity properties of the architecture
> you should be able to make the stored value an invalid pointer (locked
> and unlocked) on at least some architectures.

I'd rather avoid writing arch-specific code if I can avoid it.  It isn't
clear that the value of your proposal justifies the cost.
Over-loading the low-order bits of a pointer is (I think) a well
understood pattern.  I'd rather stick with such patterns.

Thanks,
NeilBrown


>
>   David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)


signature.asc
Description: PGP signature


Re: [PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-12-16 Thread NeilBrown
On Sun, Dec 16 2018, Florian Fainelli wrote:

> On December 16, 2018 3:19:22 PM PST, NeilBrown  wrote:
>>On Sun, Dec 16 2018, David Miller wrote:
>>
>>> From: NeilBrown 
>>> Date: Mon, 17 Dec 2018 09:08:54 +1100
>>>
>>>> In my 4.4 kernel, the build_skb() call in (the equivalent of)
>>>> mtk_poll_rx() takes about 1.2usec and the call to napi_gro_receive()
>>>> takes about 3usec.
>>>> 
>>>> In my 4.20 kernel, these calls take about 30 and 24 usec
>>respectively.
>>>> This easily explains the slowdown.
>>>
>>> That's a huge difference.
>>>
>>> Nothing jumps out as a possible cause except perhaps retpoline or
>>> something like that.
>>
>>I'll keep that in mind - thanks.
>>
>>My guess was CPU-cache invalidation.
>>I just checked and the other CPU core (there are two - each
>>hyper-threaded - "other" meaning not the one that handles ethernet
>>interrupts) gets several thousand "IPI resched" interrupts while
>>running a 10 second (226MByte) iperf3 receive test.
>>About 17KB transferred per IPI.
>>I cannot see where build_skb() would do cache invalidation though.
>
> It doesn't the driver is responsible for that. How is coherency maintained 
> between cores?

I suspect so - yes.  Coherency only needs explicit management with DMA
is used.  This wasn't the problem.

Further investigation showed that the problem was that I had
CONFIG_SLUB_DEBUG set.  That was probably useful in some earlier
debugging exercise, but it clearly isn't useful when performance-testing
the network.
I removed that and I have much nicer numbers - not quite the consistent
900+ that I saw with 4.4, but a lot closer.

Thanks for the encouragement, and sorry of the noise.

NeilBrown


>
> The IPI could be due to receive packet steering, is the MAC multi queue aware 
> on the RX path?
> -- 
> Florian


signature.asc
Description: PGP signature


Re: [PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-12-16 Thread NeilBrown
On Sun, Dec 16 2018, David Miller wrote:

> From: NeilBrown 
> Date: Mon, 17 Dec 2018 09:08:54 +1100
>
>> In my 4.4 kernel, the build_skb() call in (the equivalent of)
>> mtk_poll_rx() takes about 1.2usec and the call to napi_gro_receive()
>> takes about 3usec.
>> 
>> In my 4.20 kernel, these calls take about 30 and 24 usec respectively.
>> This easily explains the slowdown.
>
> That's a huge difference.
>
> Nothing jumps out as a possible cause except perhaps retpoline or
> something like that.

I'll keep that in mind - thanks.

My guess was CPU-cache invalidation.
I just checked and the other CPU core (there are two - each
hyper-threaded - "other" meaning not the one that handles ethernet
interrupts) gets several thousand "IPI resched" interrupts while
running a 10 second (226MByte) iperf3 receive test.
About 17KB transferred per IPI.
I cannot see where build_skb() would do cache invalidation though.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-12-16 Thread NeilBrown
On Tue, Dec 11 2018, NeilBrown wrote:

>
> I got your patch working on 4.20-rc5 and did a performance comparison.
> With the staging driver (using iperf3) I get
>   220 MBit/sec in
>   680 MBit/sec out
>
> with the patched mainline driver I get
>   190 MBit/sec in
>93 MBit/sec out
>
> (numbers are a bit rubbery, but within 10%)
>
> I haven't looked into why this might be, but thought I would mention it.
>
> Strangely when I test with scp, I get about 10MB/sec in both directions
> with both drivers.  Maybe the CPU limits encryption speed.
>
> I have a 4.4-based kernel where I get 940MBit/sec both ways - using a
> precursor of the current staging driver.

Just FYI, I've been looking further into this, and I don't think the
problem is (entirely) related to the driver.

In my 4.4 kernel, the build_skb() call in (the equivalent of)
mtk_poll_rx() takes about 1.2usec and the call to napi_gro_receive()
takes about 3usec.

In my 4.20 kernel, these calls take about 30 and 24 usec respectively.
This easily explains the slowdown.
I don't yet know why, and won't have time to look for a few days.
I haven't looked into how this affects the
drivers/net/ethernet/mediatek driver in 4.20.

If anyone has ideas about why these might be so slow, I'd love to hear
them.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk

2018-12-12 Thread NeilBrown
On Thu, Dec 13 2018, Herbert Xu wrote:

> On Wed, Dec 12, 2018 at 07:49:08PM +1100, NeilBrown wrote:
>> On Wed, Dec 12 2018, Herbert Xu wrote:
>> 
>> > On Wed, Dec 12, 2018 at 05:41:29PM +1100, NeilBrown wrote:
>> >>
>> >> So you would substantially slow down the rhashtable_walk_start() step.
>> >
>> > This whole thing is meant for uses such as /proc and netlink
>> > enumeration.  Speed is definitely not a prerogative of these
>> > iterators.
>> 
>> An RCU grace period is typically 10s of milliseconds.  It doesn't take
>> very many of those to start being noticed.
>
> Why would you even need an RCU grace period for deleting and adding
> the walker object to the bucket list? You just allocate a new one
> each time and free the old one after an RCU grace period.  I don't
> see where the latency is coming from.

Yes, you could rcu_free the old one and allocate a new one.  Then you
would have to be ready to deal with memory allocation failure which
complicates usage (I already don't like that rhashtable_insert() can
report -ENOMEM!).

Another problem with inserting a marker object on rhashtable_walk_stop()
is that it might not be clear where to insert it.
You would have to take the spinlock, and then you might find that the
location that you want to mark has been deleted from the list.  I think
you can probably still manage a reliable placement, but it could get
messy.

>
>> > For that matter, if speed was an issue then surely you would
>> > not drop out of the RCU read lock at all while iterating.
>> 
>> tipc_nl_sk_walk() drops out of RCU for every object.
>> I don't know what it is used for, but I doubt that making it thousands
>> of times slower would be a good thing.
>
> Now you're conflating two different things.  Dropping the RCU
> isn't necessarily slow.  We were talking about waiting for an
> RCU grace period which would only come into play if you were
> suspending the walk indefinitely.  Actually as I said above even
> there you don't really need to wait.

How would rhashtable_walk_stop() know if it was indefinite or not?

Yes, if you allocate a new marker on each rhashtable_walk_start()
(passing in a gfp_t and coping with errors) then you don't need to wait
a grace period.  If you reuse one to avoid the errors, you do.

>
>> > It sounds to me like you're trying to use this interface for
>> > something that it's simply not designed to do.
>> 
>> I'm just trying to make sure we have a reliable data structure which I
>> can use without having to be worried that I might accidentally hit some
>> unsupported behaviour.
>
> Now that is something I agree with.
>
>> I don't really see why you think my patch is all that complex.
>> The only conceptual change is that hash chains are now ordered, and
>> ordered chains are easy to understand.
>> The biggest part of the code change is just moving some code from
>> rhashtable_walk_start() to rhashtable_walk_next().
>> 
>> Why don't you like it?
>
> My main beef is the fact that it doesn't work with rhlist.  So down
> the track either we'll have to add more complexity for rhlist or
> we will have to rip this out again do something completely different.

You have previously said that functionality which isn't needed by current
code should not be a concern.  No current code ever stops and restarts
an rhlist walk, so this shouldn't be relevant.  Solve that problem
when/if we come to it.

If this is really the main beef, then let's turn the question around.
Suppose this patch had landed before rhltables had landed.  How would we
implement rhltables without breaking this functionality?

Keeping all the objects with the same key in a linked list is clearly a
good idea - make this a circular list as there is no obvious "first".

*Not* keeping them all in the hash chain is ideal, but not essential.
I see three costs with this.
One is that we would compare the same key multiple times for lookup.
How much of a problem is that? A failing compare is usually quite quick,
and most rhltable uses have inline memcmp for comparison (admittedly not
all).

The second cost is tracking the chain length against elasticity.
We could flag one object with each key as a 'master' (low bit of the
'next' pointer) and only count the masters.  When lookup raced with
remove this might get a slightly incorrect count, but I don't think that
hurts.

Finally, there is more pointer chasing as the chains are longer.

Was performance part of the reason for adding rhltables?  It isn't
mentioned in the commit message, but it is easy to miss things.

Thanks,
NeilBrown



>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk

2018-12-12 Thread NeilBrown
On Wed, Dec 12 2018, Herbert Xu wrote:

> On Wed, Dec 12, 2018 at 05:41:29PM +1100, NeilBrown wrote:
>>
>> So you would substantially slow down the rhashtable_walk_start() step.
>
> This whole thing is meant for uses such as /proc and netlink
> enumeration.  Speed is definitely not a prerogative of these
> iterators.

An RCU grace period is typically 10s of milliseconds.  It doesn't take
very many of those to start being noticed.

>
> For that matter, if speed was an issue then surely you would
> not drop out of the RCU read lock at all while iterating.

tipc_nl_sk_walk() drops out of RCU for every object.
I don't know what it is used for, but I doubt that making it thousands
of times slower would be a good thing.

>
> It sounds to me like you're trying to use this interface for
> something that it's simply not designed to do.

I'm just trying to make sure we have a reliable data structure which I
can use without having to be worried that I might accidentally hit some
unsupported behaviour.

I don't really see why you think my patch is all that complex.
The only conceptual change is that hash chains are now ordered, and
ordered chains are easy to understand.
The biggest part of the code change is just moving some code from
rhashtable_walk_start() to rhashtable_walk_next().

Why don't you like it?

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk

2018-12-11 Thread NeilBrown
On Wed, Dec 12 2018, Herbert Xu wrote:

> On Wed, Dec 12, 2018 at 11:02:35AM +1100, NeilBrown wrote:
>> 
>> So I think this is a real bug - it is quite unlikely to hit, but
>> possibly.
>> You need a chain with at least 2 objects, you need
>> rhashtable_walk_stop() to be called after visiting an object other than
>> the last object, and you need some thread (this or some other) to remove
>> that object from the table.
>> 
>> The patch that I posted aims to fix that bug, and only that bug.
>> The only alternative that I can think of is to document that this can
>> happen and advise that a reference should be held to the last visited
>> object when stop/start is called, or in some other way ensure that it
>> doesn't get removed.
>
> Thanks for reminding me of the issue you were trying to fix.
>
> So going back into the email archives, I suggested at the very
> start that we could just insert the walker objects into the actual
> hash table.  That would solve the issue for both rhashtable and
> rhlist.
>
> Could we do that rather than using this ordered list that only
> works for rhashtable?

No. that doesn't work.
When you remove the walker object from the hash chain, you need to wait
for the RCU grace period to expire before you can safely insert back
into the chain. Inserting into a different chain isn't quite so bad now
that the nulls-marker stuff is working, a lookup thread will notice the
move and retry the lookup.

So you would substantially slow down the rhashtable_walk_start() step.
I've tried to think of various ways to over come this problem, such as
walking backwards through each chain - it is fairly safe to move and
object earlier in the chain - but all the approaches I have tried make
the code much more complex.

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk

2018-12-11 Thread NeilBrown
On Tue, Dec 11 2018, Herbert Xu wrote:

> Hi Neil:
>
> On Mon, Dec 10, 2018 at 09:50:43AM +1100, NeilBrown wrote:
>>  I think it was agreed that I would not pursue features that were only
>>  of use to out-of-tree code, but I don't think that applies here.  This
>>  is not a feature, this is a quality-of-implementation improvement.
>>  There are users in the kernel today which use
>>rhashtable_walk_stop()/rhashtable_walk_start()
>>  to drop out of RCU protection for periods during the walk.
>>  Any such user might miss seeing an object that has been in the table
>>  for a while - sure that is less than optimal, and should be fixed if
>>  the cost is small.
>> 
>>  There are currently no rhlist users which use stop/start to drop out
>>  of RCU, so there is no clear value in fixing that case, or cost in not
>>  fixing it.
>
> I think the complexities of this patch outweighs any benefit.
>
> Walking an rhashtable is inherently unreliable, due to rehashing.
> If you need an 100% reliable walking mechanism, then add your own
> linked list alongside the hash table like xfrm does.
>
> Having said that, if the current code is missing items that existed
> prior to the start of the walk (in the absence of a rehash) then
> that would be a bug that we should fix.  Anything else like
> duplicates just needs to be accepted by the caller.
>
> So please explain how can an object be missed then we can work on
> fixing that and that only.

The commit comment gives the context:

  If the sequence:
 obj = rhashtable_walk_next(iter);
 rhashtable_walk_stop(iter);
 rhashtable_remove_fast(ht, &obj->head, params);
 rhashtable_walk_start(iter);

   races with another thread inserting or removing
   an object on the same hash chain, a subsequent
   rhashtable_walk_next() is not guaranteed to get the "next"
   object. It is possible that an object could be
   repeated, or missed.

The rhashtable_walk_start() at the end of this sequence will find that
iter->p is not null (it will be precisely &obj->head) and will look for
it in the chain, but will not find it (because of the remove_fast).  So
it sets iter->p to NULL.  This causes rhashtable_walk_next() to fall
through to __rhashtable_walk_find_next() which looks for the entry
number item->skip in the chain for item->slot.
->skip will be the index for the entry just beyond obj->head.  Since
that has been removed, it is actually the index for the object one
further on.  So if obj->head was not the last entry in the chain, the
object after it will be missed.

The code in tipc_nl_sk_walk() is fairly similar to this pattern in that
the sock_put() call could potentially result in a call to
rhashtable_remove_fast().
It avoids the problem by ensuring that the rhashtable_remove_fast() is
*after* the rhashtable_walk_start().

If the rhashtable_remove_fast() happened from some other thread due to a
race, the walk could still miss an object.
Currently, the only way to protect against this is to hold a reference
to an object across rhashtable_walk_stop()/rhashtable_walk_start().
Sometimes that is easy to do, other times - not so much.

So I think this is a real bug - it is quite unlikely to hit, but
possibly.
You need a chain with at least 2 objects, you need
rhashtable_walk_stop() to be called after visiting an object other than
the last object, and you need some thread (this or some other) to remove
that object from the table.

The patch that I posted aims to fix that bug, and only that bug.
The only alternative that I can think of is to document that this can
happen and advise that a reference should be held to the last visited
object when stop/start is called, or in some other way ensure that it
doesn't get removed.

Thanks,
NeilBrown


>
> Thanks,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-12-10 Thread NeilBrown
On Fri, Nov 30 2018, Bjørn Mork wrote:

> g...@kernel.org writes:
>
>> I have been working towards supporting the MT7530 switch as used in the
>> MediaTek MT7621 SoC. Unlike the MediaTek MT7623 the MT7621 is built around
>> a dual core MIPS CPU architecture. But underneath it is what appears to
>> be the same 7530 switch.
>
> Great!  Good to see someone pushing this idea forward.
>
>> The following 3 patches are more of an RFC than anything. They allow
>> use of the mt7530 dsa driver on this device - though with some issues
>> still to resolve. The primary change required is to not use the 7623
>> specific clock and regulator setup - none of that applies when using
>> the 7621 (and maybe other devices?). The other change required is to
>> set the 7530 MFC register CPU port number and enable bit.
>>
>> The unresolved issues I still have appear to be more related to the
>> MT7621 ethernet driver (drivers/staging/mt7621-eth/*). I am hoping
>> someone might have some ideas on these. I don't really have any good
>> documentation on the ethernet devices on the 7621, so I am kind of
>> working in the dark here.
>
> No offense, but the mt7621-eth driver in staging is horrible.  What both
> René and I have had some success with is adapting the mtk_eth_soc driver
> already in drivers/net/ethernet/mediatek/.  Yes, I know this is supposed
> to be for other SoCs, but the basic design is obviously the same.
>
> I have had some success with a first hackish attemt based on OpenWrt.
> You can find the early tree here, but note that my focus was basically
> getting one specific MT7621 board up and running:
> https://github.com/bmork/LEDE/tree/mt7621-with-mainline-eth-driver
>
> This patch has most of the necessary changes to enable that driver for
> MT7621:
> https://github.com/bmork/LEDE/commit/3293bc63f5461ca1eb0bbc4ed90145335e7e3404
>
> Not a big deal, as you can see.  There is of course a reason I didn't
> submit this here yet: It is by no means finished...  But it works. And I
> have both GMACs working with this driver, which was my primary goal.

Thanks a lot for posting this.  I'd be very happy to see the staging
driver disappear once this is ready and in mainline.

I got your patch working on 4.20-rc5 and did a performance comparison.
With the staging driver (using iperf3) I get
  220 MBit/sec in
  680 MBit/sec out

with the patched mainline driver I get
  190 MBit/sec in
   93 MBit/sec out

(numbers are a bit rubbery, but within 10%)

I haven't looked into why this might be, but thought I would mention it.

Strangely when I test with scp, I get about 10MB/sec in both directions
with both drivers.  Maybe the CPU limits encryption speed.

I have a 4.4-based kernel where I get 940MBit/sec both ways - using a
precursor of the current staging driver.

Thanks,
NeilBrown


>
>> 1. TX packets are not getting an IP header checksum via the normal
>>off-loaded checksumming when in DSA mode. I have to switch off
>>NETIF_F_IP_CSUM, so the software stack generates the checksum.
>>That checksum offloading works ok when not using the 7530 DSA driver.
>
> Hmm.  How do I test this?
>
>> 2. Maximal sized RX packets get silently dropped. So receive side packets
>>that are large (perfect case is the all-but-last packets in a fragemented
>>larger packet) appear to be dropped at the mt7621 ethernet MAC level.
>>The 7530 MIB switch register counters show receive packets at the physical
>>switch port side and at the CPU switch port - but I get no packets
>>received or errors in the 7621 ethernet MAC. If I set the mtu of the
>>server at the other end a little smaller (a few bytes is enough) then
>>I get all the packets through. It seems like the DSA/VLAN tag bytes
>>are causing a too large packet to get silently dropped somewhere.
>
> Are you referring to the configured MTU size or some other maximal size?
> If MTU, then I don't seem to have this issue with the driver from
> drivers/net/ethernet/mediatek/.
>
>
>
> Bjørn


signature.asc
Description: PGP signature


Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk

2018-12-09 Thread NeilBrown
On Fri, Dec 07 2018, Herbert Xu wrote:

> On Wed, Dec 05, 2018 at 02:51:02PM +1100, NeilBrown wrote:
>> 
>> If the sequence:
>>obj = rhashtable_walk_next(iter);
>>rhashtable_walk_stop(iter);
>>rhashtable_remove_fast(ht, &obj->head, params);
>>rhashtable_walk_start(iter);
>> 
>>  races with another thread inserting or removing
>>  an object on the same hash chain, a subsequent
>>  rhashtable_walk_next() is not guaranteed to get the "next"
>>  object. It is possible that an object could be
>>  repeated, or missed.
>> 
>>  This can be made more reliable by keeping the objects in a hash chain
>>  sorted by memory address.  A subsequent rhashtable_walk_next()
>>  call can reliably find the correct position in the list, and thus
>>  find the 'next' object.
>> 
>>  It is not possible to take this approach with an rhltable as keeping
>>  the hash chain in order is not so easy.  When the first object with a
>>  given key is removed, it is replaced in the chain with the next
>>  object with the same key, and the address of that object may not be
>>  correctly ordered.
>>  I have not yet found any way to achieve the same stability
>>  with rhltables, that doesn't have a major impact on lookup
>>  or insert.  No code currently in Linux would benefit from
>>  such extra stability.
>> 
>>  With this patch:
>>  - a new object is always inserted after the last object with a
>>smaller address, or at the start.
>>  - when rhashtable_walk_start() is called, it records that 'p' is not
>>'safe', meaning that it cannot be dereferenced.  The revalidation
>>that was previously done here is moved to rhashtable_walk_next()
>>  - when rhashtable_walk_next() is called while p is not NULL and not
>>safe, it walks the chain looking for the first object with an
>>address greater than p and returns that.  If there is none, it moves
>>to the next hash chain.
>> 
>> Signed-off-by: NeilBrown 
>> ---
>> 
>> This is a resend of a patch that I sent back in July.  I couldn't
>> applied then because it assumed another rhashtable patch which hadn't
>> landed yet - it now has.
>
> I thought we had agreed to drop this because nobody needs it
> currently and it doesn't handle rhlist?

Hi Herbert,
 I think it was agreed that I would not pursue features that were only
 of use to out-of-tree code, but I don't think that applies here.  This
 is not a feature, this is a quality-of-implementation improvement.
 There are users in the kernel today which use
   rhashtable_walk_stop()/rhashtable_walk_start()
 to drop out of RCU protection for periods during the walk.
 Any such user might miss seeing an object that has been in the table
 for a while - sure that is less than optimal, and should be fixed if
 the cost is small.

 There are currently no rhlist users which use stop/start to drop out
 of RCU, so there is no clear value in fixing that case, or cost in not
 fixing it.

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


[PATCH net-next] rhashtable: further improve stability of rhashtable_walk

2018-12-04 Thread NeilBrown

If the sequence:
   obj = rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);
   rhashtable_remove_fast(ht, &obj->head, params);
   rhashtable_walk_start(iter);

 races with another thread inserting or removing
 an object on the same hash chain, a subsequent
 rhashtable_walk_next() is not guaranteed to get the "next"
 object. It is possible that an object could be
 repeated, or missed.

 This can be made more reliable by keeping the objects in a hash chain
 sorted by memory address.  A subsequent rhashtable_walk_next()
 call can reliably find the correct position in the list, and thus
 find the 'next' object.

 It is not possible to take this approach with an rhltable as keeping
 the hash chain in order is not so easy.  When the first object with a
 given key is removed, it is replaced in the chain with the next
 object with the same key, and the address of that object may not be
 correctly ordered.
 I have not yet found any way to achieve the same stability
 with rhltables, that doesn't have a major impact on lookup
 or insert.  No code currently in Linux would benefit from
 such extra stability.

 With this patch:
 - a new object is always inserted after the last object with a
   smaller address, or at the start.
 - when rhashtable_walk_start() is called, it records that 'p' is not
   'safe', meaning that it cannot be dereferenced.  The revalidation
   that was previously done here is moved to rhashtable_walk_next()
 - when rhashtable_walk_next() is called while p is not NULL and not
   safe, it walks the chain looking for the first object with an
   address greater than p and returns that.  If there is none, it moves
   to the next hash chain.

Signed-off-by: NeilBrown 
---

This is a resend of a patch that I sent back in July.  I couldn't
applied then because it assumed another rhashtable patch which hadn't
landed yet - it now has.

NeilBrown

 include/linux/rhashtable-types.h |  1 +
 include/linux/rhashtable.h   | 10 -
 lib/rhashtable.c | 82 ++--
 3 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 763d613ce2c2..bc3e84547ba7 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -126,6 +126,7 @@ struct rhashtable_iter {
struct rhashtable_walker walker;
unsigned int slot;
unsigned int skip;
+   bool p_is_unsafe;
bool end_of_table;
 };
 
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 20f9c6af7473..4a8056b66bfb 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -635,7 +635,12 @@ static inline void *__rhashtable_insert_fast(
(params.obj_cmpfn ?
 params.obj_cmpfn(&arg, rht_obj(ht, head)) :
 rhashtable_compare(&arg, rht_obj(ht, head {
-   pprev = &head->next;
+   if (rhlist) {
+   pprev = &head->next;
+   } else {
+   if (head < obj)
+   pprev = &head->next;
+   }
continue;
}
 
@@ -1131,7 +1136,8 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * Note that if you restart a walk after rhashtable_walk_stop you
  * may see the same object twice.  Also, you may miss objects if
  * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start.  Note that this is different to
+ * rhashtable_walk_enter() which never misses objects.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 852ffa5160f1..d4154b9a29a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -225,6 +225,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, 
unsigned int old_hash)
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
+   struct rhash_head __rcu **inspos;
int err = -EAGAIN;
struct rhash_head *head, *next, *entry;
spinlock_t *new_bucket_lock;
@@ -253,12 +254,15 @@ static int rhashtable_rehash_one(struct rhashtable *ht, 
unsigned int old_hash)
new_bucket_lock = rht_bucket_lock(new_tbl, new_hash);
 
spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
-   head = rht_dereference_bucket(new_tbl->buckets[new_hash],
- new_tbl, new_hash);
-
+   inspos = &new_tbl->buckets[new_hash];
+   head = rht_dereference_bucket(

[PATCH] rhashtable: detect when object movement between tables might have invalidated a lookup

2018-12-03 Thread NeilBrown

Some users of rhashtables might need to move an object from one table
to another -  this appears to be the reason for the incomplete usage
of NULLS markers.

To support these, we store a unique NULLS_MARKER at the end of
each chain, and when a search fails to find a match, we check
if the NULLS marker found was the expected one.  If not, the search
may not have examined all objects in the target bucket, so it is
repeated.

The unique NULLS_MARKER is derived from the address of the
head of the chain.  As this cannot be derived at load-time the
static rhnull in rht_bucket_nested() needs to be initialised
at run time.

Any caller of a lookup function must still be prepared for the
possibility that the object returned is in a different table - it
might have been there for some time.

Note that this does NOT provide support for other uses of
NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing
the key of an object and re-inserting it in the same table.
These could only be done safely if new objects were inserted
at the *start* of a hash chain, and that is not currently the case.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---

Thanks Herbert,
 here is the patch complete with the Acked-by incase that makes it easer
 to apply.

Thanks,
NeilBrown


 include/linux/rhashtable.h | 34 ++
 lib/rhashtable.c   |  8 +---
 2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index eb7111039247..20f9c6af7473 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -75,8 +75,19 @@ struct bucket_table {
struct rhash_head __rcu *buckets[] cacheline_aligned_in_smp;
 };
 
+/*
+ * NULLS_MARKER() expects a hash value with the low
+ * bits mostly likely to be significant, and it discards
+ * the msb.
+ * We git it an address, in which the bottom 2 bits are
+ * always 0, and the msb might be significant.
+ * So we shift the address down one bit to align with
+ * expectations and avoid losing a significant bit.
+ */
+#defineRHT_NULLS_MARKER(ptr)   \
+   ((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
 #define INIT_RHT_NULLS_HEAD(ptr)   \
-   ((ptr) = (typeof(ptr)) NULLS_MARKER(0))
+   ((ptr) = RHT_NULLS_MARKER(&(ptr)))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
 {
@@ -471,6 +482,7 @@ static inline struct rhash_head *__rhashtable_lookup(
.ht = ht,
.key = key,
};
+   struct rhash_head __rcu * const *head;
struct bucket_table *tbl;
struct rhash_head *he;
unsigned int hash;
@@ -478,13 +490,19 @@ static inline struct rhash_head *__rhashtable_lookup(
tbl = rht_dereference_rcu(ht->tbl, ht);
 restart:
hash = rht_key_hashfn(ht, tbl, key, params);
-   rht_for_each_rcu(he, tbl, hash) {
-   if (params.obj_cmpfn ?
-   params.obj_cmpfn(&arg, rht_obj(ht, he)) :
-   rhashtable_compare(&arg, rht_obj(ht, he)))
-   continue;
-   return he;
-   }
+   head = rht_bucket(tbl, hash);
+   do {
+   rht_for_each_rcu_continue(he, *head, tbl, hash) {
+   if (params.obj_cmpfn ?
+   params.obj_cmpfn(&arg, rht_obj(ht, he)) :
+   rhashtable_compare(&arg, rht_obj(ht, he)))
+   continue;
+   return he;
+   }
+   /* An object might have been moved to a different hash chain,
+* while we walk along it - better check and retry.
+*/
+   } while (he != RHT_NULLS_MARKER(head));
 
/* Ensure we see any new tables. */
smp_rmb();
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 30526afa8343..852ffa5160f1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1179,8 +1179,7 @@ struct rhash_head __rcu **rht_bucket_nested(const struct 
bucket_table *tbl,
unsigned int hash)
 {
const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
-   static struct rhash_head __rcu *rhnull =
-   (struct rhash_head __rcu *)NULLS_MARKER(0);
+   static struct rhash_head __rcu *rhnull;
unsigned int index = hash & ((1 << tbl->nest) - 1);
unsigned int size = tbl->size >> tbl->nest;
unsigned int subhash = hash;
@@ -1198,8 +1197,11 @@ struct rhash_head __rcu **rht_bucket_nested(const struct 
bucket_table *tbl,
subhash >>= shift;
}
 
-   if (!ntbl)
+   if (!ntbl) {
+   if (!rhnull)
+   INIT_RHT_NULLS_HEAD(rhnull);
return &rhnull;
+   }
 
return &ntbl[subhash].bucket;
 
-- 
2.14.0.rc0.dirty



signature.asc
Description: PGP signature


Re: [PATCH v3] rhashtable: detect when object movement between tables might have invalidated a lookup

2018-12-02 Thread NeilBrown
On Sat, Dec 01 2018, Herbert Xu wrote:

> On Fri, Nov 30, 2018 at 10:26:50AM +1100, NeilBrown wrote:
>>
>> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
>> index 30526afa8343..852ffa5160f1 100644
>> --- a/lib/rhashtable.c
>> +++ b/lib/rhashtable.c
>> @@ -1179,8 +1179,7 @@ struct rhash_head __rcu **rht_bucket_nested(const 
>> struct bucket_table *tbl,
>>  unsigned int hash)
>>  {
>>  const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
>> -static struct rhash_head __rcu *rhnull =
>> -(struct rhash_head __rcu *)NULLS_MARKER(0);
>> +static struct rhash_head __rcu *rhnull;
>>  unsigned int index = hash & ((1 << tbl->nest) - 1);
>>  unsigned int size = tbl->size >> tbl->nest;
>>  unsigned int subhash = hash;
>> @@ -1198,8 +1197,11 @@ struct rhash_head __rcu **rht_bucket_nested(const 
>> struct bucket_table *tbl,
>>  subhash >>= shift;
>>  }
>>  
>> -if (!ntbl)
>> +if (!ntbl) {
>> +if (!rhnull)
>> +INIT_RHT_NULLS_HEAD(rhnull);
>>  return &rhnull;
>> +}
>
> I think you missed my earlier reply beause of bouncing emails.

Yeah, sorry about that.  I should have looked through an lkml archive
once I realized that was happening - I have now.

>
> I think this is unnecessary because
>
>   RHT_NULLS_MARKER(RHT_NULLS_MARKER(0)) = RHT_NULLS_MARKER(0)
>

I don't understand how this is relevant.

I think you are saying that when rht_bucket_nested() finds that the
target page hasn't been allocated, it should return a pointer to a
static variable which contains RHT_NULLS_MARKER(0)

 static struct rhash_head *rhnull = RHT_NULLS_MARKER(0);

Then in __rhashtable_lookup(),
head = rht_bucket(tbl, hash);

would result in 'head' having the value '&rhnull'.

Then
rht_for_each_rcu_continue(he, *head, tbl, hash) {

would result in 'he' having the value RHT_NULLS_MARKER(0)

Then
} while (he != RHT_NULLS_MARKER(head));

will compare RHT_NULLS_MARKER(0) with RHT_NULLS_MARKED(&rhnull)
and they will be different, so it will loop indefinitely.

With respect to the shifting, you wrote:

> The top-bit is most likely to be fixed and offer no real value.

While this might be likely, it is not certain, so not relevant.
On a 32bit machine with more than 2GB of physical memory, some memory
addresses will have 0 in the msb, some will have 1.
It is possible (however unlikely) that two hash buckets in different
tables will have the same address except for the msb.  If we ignore the
msb, we might incorrectly determine that we have reached the end of the
chain from the first bucket, whereas we actually reached the end of the
chain from the second bucket.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH v3] rhashtable: detect when object movement between tables might have invalidated a lookup

2018-11-29 Thread NeilBrown

Some users of rhashtables might need to move an object from one table
to another -  this appears to be the reason for the incomplete usage
of NULLS markers.

To support these, we store a unique NULLS_MARKER at the end of
each chain, and when a search fails to find a match, we check
if the NULLS marker found was the expected one.  If not, the search
may not have examined all objects in the target bucket, so it is
repeated.

The unique NULLS_MARKER is derived from the address of the
head of the chain.  As this cannot be derived at load-time the
static rhnull in rht_bucket_nested() needs to be initialised
at run time.

Any caller of a lookup function must still be prepared for the
possibility that the object returned is in a different table - it
might have been there for some time.

Note that this does NOT provide support for other uses of
NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing
the key of an object and re-inserting it in the same table.
These could only be done safely if new objects were inserted
at the *start* of a hash chain, and that is not currently the case.

Signed-off-by: NeilBrown 
---

This version has an added comment to explain the ">>1" in
RHT_NULLS_MARKER().

Thanks,
NeilBrown

 include/linux/rhashtable.h | 34 ++
 lib/rhashtable.c   |  8 +---
 2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index eb7111039247..ae507af54800 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -75,8 +75,19 @@ struct bucket_table {
struct rhash_head __rcu *buckets[] cacheline_aligned_in_smp;
 };
 
+/*
+ * NULLS_MARKER() expects a hash value with the low
+ * bits mostly likely to be significant, and it discards
+ * the msb.
+ * We git it an address, in which the bottom 2 bits are
+ * always 0, and the msb might be significant.
+ * So we shift the address down one bit to align with
+ * expectations and avoid losing a significant bit.
+ */
+#defineRHT_NULLS_MARKER(ptr)   \
+   ((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
 #define INIT_RHT_NULLS_HEAD(ptr)   \
-   ((ptr) = (typeof(ptr)) NULLS_MARKER(0))
+   ((ptr) = RHT_NULLS_MARKER(&(ptr)))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
 {
@@ -471,6 +482,7 @@ static inline struct rhash_head *__rhashtable_lookup(
.ht = ht,
.key = key,
};
+   struct rhash_head __rcu * const *head;
struct bucket_table *tbl;
struct rhash_head *he;
unsigned int hash;
@@ -478,13 +490,19 @@ static inline struct rhash_head *__rhashtable_lookup(
tbl = rht_dereference_rcu(ht->tbl, ht);
 restart:
hash = rht_key_hashfn(ht, tbl, key, params);
-   rht_for_each_rcu(he, tbl, hash) {
-   if (params.obj_cmpfn ?
-   params.obj_cmpfn(&arg, rht_obj(ht, he)) :
-   rhashtable_compare(&arg, rht_obj(ht, he)))
-   continue;
-   return he;
-   }
+   head = rht_bucket(tbl, hash);
+   do {
+   rht_for_each_rcu_continue(he, *head, tbl, hash) {
+   if (params.obj_cmpfn ?
+   params.obj_cmpfn(&arg, rht_obj(ht, he)) :
+   rhashtable_compare(&arg, rht_obj(ht, he)))
+   continue;
+   return he;
+   }
+   /* An object might have been moved to a different hash chain,
+* while we walk along it - better check and retry.
+*/
+   } while (he != RHT_NULLS_MARKER(head));
 
/* Ensure we see any new tables. */
smp_rmb();
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 30526afa8343..852ffa5160f1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1179,8 +1179,7 @@ struct rhash_head __rcu **rht_bucket_nested(const struct 
bucket_table *tbl,
unsigned int hash)
 {
const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
-   static struct rhash_head __rcu *rhnull =
-   (struct rhash_head __rcu *)NULLS_MARKER(0);
+   static struct rhash_head __rcu *rhnull;
unsigned int index = hash & ((1 << tbl->nest) - 1);
unsigned int size = tbl->size >> tbl->nest;
unsigned int subhash = hash;
@@ -1198,8 +1197,11 @@ struct rhash_head __rcu **rht_bucket_nested(const struct 
bucket_table *tbl,
subhash >>= shift;
}
 
-   if (!ntbl)
+   if (!ntbl) {
+   if (!rhnull)
+   INIT_RHT_NULLS_HEAD(rhnull);
return &rhnull;
+   }
 
return &ntbl[subhash].bucket;
 
-- 
2.14.0.rc0.dirty



signature.asc
Description: PGP signature


Re: [PATCH 2/5] rhashtable: don't hold lock on first table throughout insertion.

2018-07-24 Thread NeilBrown
On Tue, Jul 24 2018, Paul E. McKenney wrote:

> On Tue, Jul 24, 2018 at 07:52:03AM +1000, NeilBrown wrote:
>> On Mon, Jul 23 2018, Paul E. McKenney wrote:
>> 
>> > On Mon, Jul 23, 2018 at 09:13:43AM +1000, NeilBrown wrote:
>> >> On Sun, Jul 22 2018, Paul E. McKenney wrote:
>> >> >
>> >> > One issue is that the ->func pointer can legitimately be NULL while on
>> >> > RCU's callback lists.  This happens when someone invokes kfree_rcu()
>> >> > with the rcu_head structure at the beginning of the enclosing structure.
>> >> > I could add an offset to avoid this, or perhaps the kmalloc() folks
>> >> > could be persuaded Rao Shoaib's patch moving kfree_rcu() handling to
>> >> > the slab allocators, so that RCU only ever sees function pointers in
>> >> > the ->func field.
>> >> >
>> >> > Either way, this should be hidden behind an API to allow adjustments
>> >> > to be made if needed.  Maybe something like is_after_call_rcu()?
>> >> > This would (for example) allow debug-object checks to be used to catch
>> >> > check-after-free bugs.
>> >> >
>> >> > Would something of that sort work for you?
>> >> 
>> >> Yes, if you could provide an is_after_call_rcu() API, that would
>> >> perfectly suit my use-case.
>> >
>> > After beating my head against the object-debug code a bit, I have to ask
>> > if it would be OK for you if the is_after_call_rcu() API also takes the
>> > function that was passed to RCU.
>> 
>> Sure.  It feels a bit clumsy, but I can see it could be easier to make
>> robust.
>> So yes: I'm fine with pass the same function and rcu_head to both
>> call_rcu() and is_after_call_rcu().  Actually, when I say it like that,
>> it seems less clumsy :-)
>
> How about like this?  (It needs refinements, like lockdep, but should
> get the gist.)
>

Looks good ... except ... naming is hard.

 is_after_call_rcu_init()  asserts where in the lifecycle we are,
 is_after_call_rcu() tests where in the lifecycle we are.

 The names are similar but the purpose is quite different.
 Maybe s/is_after_call_rcu_init/call_rcu_init/ ??

Thanks,
NeilBrown


>   Thanx, Paul
>
> 
>
> commit 5aa0ebf4799b8bddbbd0124db1c008526e99fc7c
> Author: Paul E. McKenney 
> Date:   Tue Jul 24 15:28:09 2018 -0700
>
> rcu: Provide functions for determining if call_rcu() has been invoked
> 
> This commit adds is_after_call_rcu() and is_after_call_rcu_init()
> functions to help RCU users detect when another CPU has passed
> the specified rcu_head structure and function to call_rcu().
> The is_after_call_rcu_init() should be invoked before making the
> structure visible to RCU readers, and then the is_after_call_rcu() may
> be invoked from within an RCU read-side critical section on an rcu_head
> structure that was obtained during a traversal of the data structure
> in question.  The is_after_call_rcu() function will return true if the
> rcu_head structure has already been passed (with the specified function)
> to call_rcu(), otherwise it will return false.
> 
> If is_after_call_rcu_init() has not been invoked on the rcu_head
> structure or if the rcu_head (AKA callback) has already been invoked,
> then is_after_call_rcu() will do WARN_ON_ONCE().
> 
> Reported-by: NeilBrown 
> Signed-off-by: Paul E. McKenney 
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index e4f821165d0b..82e5a91539b5 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -857,6 +857,45 @@ static inline notrace void 
> rcu_read_unlock_sched_notrace(void)
>  #endif /* #else #ifdef CONFIG_ARCH_WEAK_RELEASE_ACQUIRE */
>  
>  
> +/* Has the specified rcu_head structure been handed to call_rcu()? */
> +
> +/*
> + * is_after_call_rcu_init - Initialize rcu_head for is_after_call_rcu()
> + * @rhp: The rcu_head structure to initialize.
> + *
> + * If you intend to invoke is_after_call_rcu() to test whether a given
> + * rcu_head structure has already been passed to call_rcu(), then you must
> + * also invoke this is_after_call_rcu_init() function on it just after
> + * allocating that structure.  Calls to this function must not race with
> + * calls to call_rcu(), is_after_call_rcu(), or callback invocation.
> + */
> +static inline void is_after_call_rcu_init(struct rcu_head *rhp)
> +{
> + rhp-&

Re: [PATCH - revised] rhashtable: detect when object movement might have invalidated a lookup

2018-07-15 Thread NeilBrown
On Mon, Jul 16 2018, Herbert Xu wrote:

> On Mon, Jul 16, 2018 at 09:57:11AM +1000, NeilBrown wrote:
>> 
>> Some users of rhashtable might need to change the key
>> of an object and move it to a different location in the table.
>> Other users might want to allocate objects using
>> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
>> being used for a different (type-compatible) purpose and similarly
>> end up in a different hash-chain.
>> 
>> To support these, we store a unique NULLS_MARKER at the end of
>> each chain, and when a search fails to find a match, we check
>> if the NULLS marker found was the expected one.  If not,
>> the search is repeated.
>> 
>> The unique NULLS_MARKER is derived from the address of the
>> head of the chain.
>> 
>> If an object is removed and re-added to the same hash chain, we won't
>> notice by looking that the NULLS marker.  In this case we must be sure
>> that it was not re-added *after* its original location, or a lookup may
>> incorrectly fail.  The easiest solution is to ensure it is inserted at
>> the start of the chain.  insert_slow() already does that,
>> insert_fast() does not.  So this patch changes insert_fast to always
>> insert at the head of the chain.
>> 
>> Note that such a user must do their own double-checking of
>> the object found by rhashtable_lookup_fast() after ensuring
>> mutual exclusion which anything that might change the key, such as
>> successfully taking a new reference.
>> 
>> Signed-off-by: NeilBrown 
>
> I still don't understand why we need this feature.  The only
> existing user of this (which currently doesn't use rhashtable)
> does not readd the reused entry to the same table.  IOW the flow
> is always from table A to table B.  After which the entry will
> be properly freed rather than reused.
>
> So who is going to use this?

I want it so I can use SLAB_TYPESAFE_BY_RCU slab caches in lustre.
lustre isn't in mainline any more so I cannot point at the code but the
concept is simple.
Without this, you need to use rcu_free to free any object added to an
rhashtable.
When kmalloc is used, kfree_rcu() can be used, which is fairly painless.
When a kmem_cache is used, you need to provide your own rcu free
function, which is clumsy.
With SLAB_TYPESAFE_BY_RCU, that isn't needed.  The object can be freed
immediately, providing you can cope with it being re-used (as the same
type) immediately.  Part of handling that is coping with the possibility
that it might be inserted into the same hash table, and possibly the
same chain, immediately it is freed.
lustre has 6 different resizable hashtables which I want to convert
to use rhashtable.
I currently need call_rcu handlers for 3 for these.  I want those
3 to use SLAB_TYPESAFE_BY_RCU instead so they can use
kmem_cache_free() directly.  For this, I need rhashtable to be safe if
an object is deleted and immediately re-inserted into the same hash
chain.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.

2018-05-07 Thread NeilBrown
On Mon, May 07 2018, Herbert Xu wrote:

> On Sun, May 06, 2018 at 07:37:49AM +1000, NeilBrown wrote:
>> 
>> I can see no evidence that this is required for anything, as it isn't
>> use and I'm fairly sure that in it's current form - it cannot be used.
>
> Search for nulls in net/ipv4.  This is definitely used throughout
> the network stack.  As the aim is to convert as many existing hash
> tables to rhashtable as possible, we want to keep this.

Yes, I'm sure that nulls lists are very useful and I can see them used
in net/ipv4 (and I would like to use them myself).  I don't want to
remove the nulls list concept from rhashtable, and my patch doesn't do
that.

It just removes nulls_base (as the subject says) and stops setting any
particular value in the nulls pointer, because the value is currently
never tested.

The point of nulls lists is, as I'm sure you know, to detect if an
object was moved to a different chain and to restart the search in that
case.
This means that on an unsuccessful search, we need to test
get_nulls_value() of the tail pointer.
Several times in net/ipv4/inet_hashtables.c (for example) we see code
like:

/*
 * if the nulls value we got at the end of this lookup is
 * not the expected one, we must restart lookup.
 * We probably met an item that was moved to another chain.
 */
if (get_nulls_value(node) != slot)
goto begin;

which does exactly that.  This test-and-restart cannot be done by a
caller to rhashtable_lookup_fast() as the caller doesn't even get to
the tail nulls.
It would need to be done in rhashtable.[ch].
If you like, I can make this functionality work rather than just
removing the useless parts.

I would change INIT_RHT_NULLS_HEAD() to return to be
#define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
((ptr) = (void*)NULLS_MARKER(((unsigned long)ptr)>>1)

Then when __rhashtable_lookup() comes to the end of the chain
without finding anything it would do something like
  if (get_nulls_value(he)<<1 == (unsigned long)head)
goto restart;

where 'head' is the head of the list that we would have saved at the
start.

i.e. instead of using a nulls_base and limiting the size of the hash, we
store the address of the bucket in the nulls.

In my mind that should be a separate patch.  First remove the unused,
harmful code.  Then add code to provide useful functionality.  But
I can put the two together in one patch if you would prefer that.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk

2018-05-07 Thread NeilBrown
On Mon, May 07 2018, Herbert Xu wrote:

> On Sun, May 06, 2018 at 07:50:54AM +1000, NeilBrown wrote:
>>
>> Do we?  How could we fix it for both rhashtable and rhltable?
>
> Well I suggested earlier to insert the walker object into the
> hash table, which would be applicable regardless of whether it
> is a normal rhashtable of a rhltable.

Oh yes, that idea.  I had considered it and discarded it.
Moving a walker object around in an rcu-list is far from straight
forward.  A lockless iteration of the list would need to be very careful
not to be tripped up by the walker-object.  I don't think we want to
make the search code more complex.

I explored the idea a bit more and I think that any solution that we
came up with along those lines would look quite different for regular
rhash_tables than for rhl_tables.  So it is probably best to keep them
quite separate.
i.e. use the scheme I proposed for rhashtables and use something else
entirely for rhltables.

My current thinking for a stable walk for rhl tables goes like this:

- we embed
 struct rhl_cursor {
 struct rhl_cursor *next;
 int unseen;
 }
  in struct rhashtable_iter.

- on rhashtable_stop(), we:
  + lock the current hash chain
  + find the next object that is still in the table (after ->p).
  + count the number of objects from there to the end to the
 end of the rhlist_head.next list.
  + store that count in rhl_cursor.unseen
  + change the ->next pointer at the end of the list to point to the
rhl_cursor, but with the lsb set.  If there was already an
rhl_cursor there, they are chained together on ->next.

- on rhashtable_start(), we lock the hash chain and search the hash
  chain and sub-chains for  ->p or the rhl_cursor.
  If ->p is still there, that can be used as a reliable anchor.
  If ->p isn't found but the rhl_cursor is, then the 'unseen' counter
  tells us where to start in that rhl_list.
  If neither are found, then we start at the beginning of the hash chain.
  If the rhl_cursor is found, it is removed.

- lookup and insert can almost completely ignore the cursor.
  rhl_for_each_rcu() needs to detect the end-of-list by looking for
  lsb set, but that is the only change.
  As _insert() adds new objects to the head of the rhlist, that
  won't disturb the accuracy of the 'unseen' count.  The newly inserted
  object won't be seen by the walk, but that is expected.

- delete needs some extra handling.
  + When an object is deleted from an rhlist and the list does not become
empty, then it counts the number of objects to the end of the list,
and possibly decrements the 'unseen' counter on any rhl_cursors that
are at the end of the list.
  + when an object is deleted resulting in an empty rhlist, any
cursors at the end of the list needs to be moved to the next
list in the hash chain, and their 'unseen' count needs to be set to
the length of the list.
If there is no next object in the hash chain, then the iter.slot is
incremented and the rhlist_curs is left unconnected.

- rehash needs to remove any cursors from the end of an rhlist before
  moving them to the new table.  The cursors are left disconnected.

I'm happy to implement this if you think the approach is workable and
the functionality is necessary.  However there are no current users of
rhltable_walk_inter that would benefit from this.


Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-07 Thread NeilBrown
On Mon, May 07 2018, Herbert Xu wrote:

> On Mon, May 07, 2018 at 08:24:41AM +1000, NeilBrown wrote:
>>
>> This is true, but I don't see how it is relevant.
>> At some point, each thread will find that the table they have just
>> locked for their search key, has a NULL 'future_tbl' pointer.
>> At the point, the thread can know that the key is not in any table,
>> and that no other thread can add the key until the lock is released.
>
> The updating of future_tbl is not synchronised with insert threads.
> Therefore it is entirely possible that two inserters end up on
> different tables as their "latest" table.  This must not be allowed
> to occur.

I disagree.
Certainly the update of future_tbl is not synchronised with insert
threads.
However there is only a single update to any given future_tbl (from NULL
to non-NULL) and two insert threads for the same key will see that
update in a synchronized way as they look at it while holding the bucket
lock for that key.
It is certainly true if that two inserters can end up on different
tables as their "latest" table, but that doesn't result in a problem.
e.g. T1 might see A as the latest table, and T2 might see B where
  A.future_tbl == B
In that case, T2 must have examined A (under the lock) *after* T1
examined A, and so will have seen if T1 inserted anything.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-06 Thread NeilBrown
On Sun, May 06 2018, Herbert Xu wrote:

> On Sun, May 06, 2018 at 08:00:49AM +1000, NeilBrown wrote:
>>
>> The insert function must (and does) take the lock on the bucket before
>> testing if there is a "next" table.
>> If one inserter finds that it has locked the "last" table (because there
>> is no next) and successfully inserts, then the other inserter cannot
>> have locked that table yet, else it would have inserted.  When it does,
>> it will find what the first inserter inserted. 
>
> If you release the lock to the first table then it may be deleted
> by the resize thread.  Hence the other inserter may not have even
> started from the same place.

This is true, but I don't see how it is relevant.
At some point, each thread will find that the table they have just
locked for their search key, has a NULL 'future_tbl' pointer.
At the point, the thread can know that the key is not in any table,
and that no other thread can add the key until the lock is released.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 7/8] rhashtable: add rhashtable_walk_prev()

2018-05-06 Thread NeilBrown
On Sat, May 05 2018, Tom Herbert wrote:

> On Sat, May 5, 2018 at 2:43 AM, Herbert Xu  
> wrote:
>> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>>> rhashtable_walk_prev() returns the object returned by
>>> the previous rhashtable_walk_next(), providing it is still in the
>>> table (or was during this grace period).
>>> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
>>> have been called since the last rhashtable_walk_next().
>>>
>>> If there have been no calls to rhashtable_walk_next(), or if the
>>> object is gone from the table, then NULL is returned.
>>>
>>> This can usefully be used in a seq_file ->start() function.
>>> If the pos is the same as was returned by the last ->next() call,
>>> then rhashtable_walk_prev() can be used to re-establish the
>>> current location in the table.  If it returns NULL, then
>>> rhashtable_walk_next() should be used.
>>>
>>> Signed-off-by: NeilBrown 
>>
>> I will ack this if Tom is OK with replacing peek with it.
>>
> I'm not following why this is any better than peek. The point of
> having rhashtable_walk_peek is to to allow the caller to get then
> current element not the next one. This is needed when table is read in
> multiple parts and we need to pick up with what was returned from the
> last call to rhashtable_walk_next (which apparently is what this patch
> is also trying to do).
>
> There is one significant difference in that peek will return the
> element in the table regardless of where the iterator is at (this is
> why peek can move the iterator) and only returns NULL at end of table.
> As mentioned above rhashtable_walk_prev can return NULL so then caller
> needs and so rhashtable_walk_next "should be used" to get the previous
> element. Doing a peek is a lot cleaner and more straightforward API in
> this regard.

Thanks for the review.  I agree with a lot of what you say about the
behavior of the different implementations.
One important difference is the documentation.  The current
documentation for rhashtable_walk_peek() is wrong.   For example it says
that the function doesn't change the iterator, but sometimes it does.
The first rhashtable patch I submitted tried to fix this up, but it is
hard to document the function clearly because it really is doing one of
two different things.  It returns the previous element if it still
exists, or it returns the next one.  With my rhashtable_walk_prev(),
that can be done with
  rhashtable_walk_prev() ?: rhashtable_walk_next();

Both of these functions can easily be documented clearly.
We could combine the two as you have done, but "peek" does seem like a
good name.  "prev_or_next" is more honest, but somewhat clumsy.
Whether that is a good thing or not is partly a matter of taste, and we
seem to be on opposite sides of that fence.
There is a practical aspect to it though.

Lustre has a debugfs seq_file which shows all the cached pages of all
the cached object.  The objects are in a hashtable (which I want to
change to an rhashtable).  So the seq_file iterator has both an
rhashtable iterator an offset in the object.

When we restart a walk, we might be in the middle of some object - but
that object might have been removed from the cache, so we would need to
go on to the first page of the next object.
Using my interface I can do

 obj = rhashtable_walk_prev(&iter.rhash_iter);
 offset = iter.offset;
 if (!obj) {
obj = rhashtable_walk_next(&iter.rhash_iter)
offset = 0;
 }

I could achieve something similar with your interface if I kept an extra
copy of the previous object and compared with the value returned by
rhashtable_walk_peek(), but (to me) that seems like double handling.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-06 Thread NeilBrown
On Sun, May 06 2018, Herbert Xu wrote:

> On Sun, May 06, 2018 at 07:48:20AM +1000, NeilBrown wrote:
>>
>> The spinlock protects 2 or more buckets.  The nested table contains at
>> least 512 buckets, maybe more.
>> It is quite possible for two insertions into 2 different buckets to both
>> get their spinlock and both try to instantiate the same nested table.
>
> I think you missed the fact that when we use nested tables the spin
> lock table is limited to just a single page and hence corresponds
> to the first level in the nested table.  Therefore it's always safe.

Yes I had missed that - thanks for pointing it out.
In fact the lock table is limited to the number of nested_tables
in the second level.
And it is the same low-order bits that choose both the lock
and the set of nested tables.
So there isn't a bug here.  So we don't need this patch. (I still like
it though - it seems more obviously correct).

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-05 Thread NeilBrown
On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> rhashtable_try_insert() currently hold a lock on the bucket in
>> the first table, while also locking buckets in subsequent tables.
>> This is unnecessary and looks like a hold-over from some earlier
>> version of the implementation.
>> 
>> As insert and remove always lock a bucket in each table in turn, and
>> as insert only inserts in the final table, there cannot be any races
>> that are not covered by simply locking a bucket in each table in turn.
>> 
>> When an insert call reaches that last table it can be sure that there
>> is no match entry in any other table as it has searched them all, and
>> insertion never happens anywhere but in the last table.  The fact that
>> code tests for the existence of future_tbl while holding a lock on
>> the relevant bucket ensures that two threads inserting the same key
>> will make compatible decisions about which is the "last" table.
>> 
>> This simplifies the code and allows the ->rehash field to be
>> discarded.
>> We still need a way to ensure that a dead bucket_table is never
>> re-linked by rhashtable_walk_stop().  This can be achieved by
>> setting the ->size to 1.  This still allows lookup code to work (and
>> correctly not find anything) but can never happen on an active bucket
>> table (as the minimum size is 4).
>> 
>> Signed-off-by: NeilBrown 
>
> I'm not convinced this is safe.  There can be multiple tables
> in existence.  Unless you hold the lock on the first table, what
> is to stop two parallel inserters each from inserting into their
> "last" tables which may not be the same table?

The insert function must (and does) take the lock on the bucket before
testing if there is a "next" table.
If one inserter finds that it has locked the "last" table (because there
is no next) and successfully inserts, then the other inserter cannot
have locked that table yet, else it would have inserted.  When it does,
it will find what the first inserter inserted. 

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk

2018-05-05 Thread NeilBrown
On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> If the sequence:
>>obj = rhashtable_walk_next(iter);
>>rhashtable_walk_stop(iter);
>>rhashtable_remove_fast(ht, &obj->head, params);
>>rhashtable_walk_start(iter);
>> 
>>  races with another thread inserting or removing
>>  an object on the same hash chain, a subsequent
>>  rhashtable_walk_next() is not guaranteed to get the "next"
>>  object. It is possible that an object could be
>>  repeated, or missed.
>> 
>>  This can be made more reliable by keeping the objects in a hash chain
>>  sorted by memory address.  A subsequent rhashtable_walk_next()
>>  call can reliably find the correct position in the list, and thus
>>  find the 'next' object.
>> 
>>  It is not possible (certainly not so easy) to achieve this with an
>>  rhltable as keeping the hash chain in order is not so easy.  When the
>>  first object with a given key is removed, it is replaced in the chain
>>  with the next object with the same key, and the address of that
>>  object may not be correctly ordered.
>>  No current user of rhltable_walk_enter() calls
>>  rhashtable_walk_start() more than once, so no current code
>>  could benefit from a more reliable walk of rhltables.
>> 
>>  This patch only attempts to improve walks for rhashtables.
>>  - a new object is always inserted after the last object with a
>>smaller address, or at the start
>>  - when rhashtable_walk_start() is called, it records that 'p' is not
>>'safe', meaning that it cannot be dereferenced.  The revalidation
>>that was previously done here is moved to rhashtable_walk_next()
>>  - when rhashtable_walk_next() is called while p is not NULL and not
>>safe, it walks the chain looking for the first object with an
>>address greater than p and returns that.  If there is none, it moves
>>to the next hash chain.
>> 
>> Signed-off-by: NeilBrown 
>
> I'm a bit torn on this.  On the hand this is definitely an improvement
> over the status quo.  On the other this does not work on rhltable and
> we do have a way of fixing it for both rhashtable and rhltable.

Do we?  How could we fix it for both rhashtable and rhltable?

Thanks,
NeilBrown



signature.asc
Description: PGP signature


Re: [PATCH 1/8] rhashtable: silence RCU warning in rhashtable_test.

2018-05-05 Thread NeilBrown
On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> print_ht in rhashtable_test calls rht_dereference() with neither
>> RCU protection or the mutex.  This triggers an RCU warning.
>> So take the mutex to silence the warning.
>> 
>> Signed-off-by: NeilBrown 
>
> I don't think the mutex is actually needed but since we don't
> have rht_dereference_raw and this is just test code:

I agrees there is no functional need for the mutex.  It just needed to
silence the warning, so that other warnings are easier to see.

>
> Acked-by: Herbert Xu 

thanks,
NeilBrown

> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-05 Thread NeilBrown
On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> If two threads run nested_table_alloc() at the same time
>> they could both allocate a new table.
>> Best case is that one of them will never be freed, leaking memory.
>> Worst case is hat entry get stored there before it leaks,
>> and the are lost from the table.
>> 
>> So use cmpxchg to detect the race and free the unused table.
>> 
>> Fixes: da20420f83ea ("rhashtable: Add nested tables")
>> Cc: sta...@vger.kernel.org # 4.11+
>> Signed-off-by: NeilBrown 
>
> What about the spinlock that's meant to be held around this
> operation?

The spinlock protects 2 or more buckets.  The nested table contains at
least 512 buckets, maybe more.
It is quite possible for two insertions into 2 different buckets to both
get their spinlock and both try to instantiate the same nested table.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 3/8] rhashtable: use cmpxchg() to protect ->future_tbl.

2018-05-05 Thread NeilBrown
On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> Rather than borrowing one of the bucket locks to
>> protect ->future_tbl updates, use cmpxchg().
>> This gives more freedom to change how bucket locking
>> is implemented.
>> 
>> Signed-off-by: NeilBrown 
>
> This looks nice.
>
>> -spin_unlock_bh(old_tbl->locks);
>> +rcu_assign_pointer(tmp, new_tbl);
>
> Do we need this barrier since cmpxchg is supposed to provide memory
> barrier semantics?

It's hard to find documentation even for what cmpxchg() is meant do, let
alone what barriers is provides, but there does seem to be something
hidden in Documentation/core-api/atomic_ops.rst which suggests full
barrier semantics if the comparison succeeds.  I'll replace the
rcu_assign_pointer with a comment saying why it isn't needed.

Thanks,
NeilBrown

>
>> +if (cmpxchg(&old_tbl->future_tbl, NULL, tmp) != NULL)
>> +return -EEXIST;
>
> Thanks,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.

2018-05-05 Thread NeilBrown
On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> This "feature" is unused, undocumented, and untested and so
>> doesn't really belong.  If a use for the nulls marker
>> is found, all this code would need to be reviewed to
>> ensure it works as required.  It would be just as easy to
>> just add the code if/when it is needed instead.
>> 
>> This patch actually fixes a bug too.  The table resizing allows a
>> table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
>> any growth beyond 2^27 is wasteful an ineffective.
>> 
>> This patch result in NULLS_MARKER(0) being used for all chains,
>> and leave the use of rht_is_a_null() to test for it.
>> 
>> Signed-off-by: NeilBrown 
>
> I disagree.  This is a fundamental requirement for the use of
> rhashtable in certain networking systems such as TCP/UDP.  So
> we know that there will be a use for this.

I can see no evidence that this is required for anything, as it isn't
use and I'm fairly sure that in it's current form - it cannot be used.

Based on my best guess about how you might intend to use it, I suspect
it would be simpler to store the address of the bucket head in the nuls
rather than the hash and a magic number.  This would make it just as
easy to detect when a search reaches the end of the wrong chain, which I
presume is the purpose.
I would find that useful myself - if the search would repeat when that
happened - as I could then use SLAB_TYPESAFE_BY_RCU.

Were we to take this approach, all the code I've removed here would
still need to be removed.

>
> As to the bug fix, please separate it out of the patch and resubmit.

I don't know how to do that.  I don't know what is safe to change
without "breaking" the nulls_base code because that code is undocumented and
unused, so unmaintainable.
In general the kernel has, I believe, a policy against keeping unused
interfaces.  While that isn't firm and universal, is seems to apply
particularly well to unusable interfaces.

Thanks,
NeilBrown


>
> Thanks,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


[PATCH 0/8] Assorted rhashtable fixes and cleanups

2018-05-03 Thread NeilBrown
This series contains some bugfixes, mostly minor though one
is worthy of a stable backport I think - tagged with Fixes and Cc: stable.

Then there are improvements to walking, which have been discussed
to some degree already.
Finally a code simplification which I think is correct...

Thanks,
NeilBrown

---

NeilBrown (8):
  rhashtable: silence RCU warning in rhashtable_test.
  rhashtable: remove nulls_base and related code.
  rhashtable: use cmpxchg() to protect ->future_tbl.
  rhashtable: fix race in nested_table_alloc()
  rhashtable: remove rhashtable_walk_peek()
  rhashtable: further improve stability of rhashtable_walk
  rhashtable: add rhashtable_walk_prev()
  rhashtable: don't hold lock on first table throughout insertion.


 include/linux/rhashtable.h |   61 +++--
 lib/rhashtable.c   |  202 +---
 lib/test_rhashtable.c  |8 +-
 3 files changed, 113 insertions(+), 158 deletions(-)

--
Signature



[PATCH 1/8] rhashtable: silence RCU warning in rhashtable_test.

2018-05-03 Thread NeilBrown
print_ht in rhashtable_test calls rht_dereference() with neither
RCU protection or the mutex.  This triggers an RCU warning.
So take the mutex to silence the warning.

Signed-off-by: NeilBrown 
---
 lib/test_rhashtable.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index f4000c137dbe..bf92b7aa2a49 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -499,6 +499,8 @@ static unsigned int __init print_ht(struct rhltable *rhlt)
unsigned int i, cnt = 0;
 
ht = &rhlt->ht;
+   /* Take the mutex to avoid RCU warning */
+   mutex_lock(&ht->mutex);
tbl = rht_dereference(ht->tbl, ht);
for (i = 0; i < tbl->size; i++) {
struct rhash_head *pos, *next;
@@ -532,6 +534,7 @@ static unsigned int __init print_ht(struct rhltable *rhlt)
}
}
printk(KERN_ERR "\n ht: %s\n-\n", buff);
+   mutex_unlock(&ht->mutex);
 
return cnt;
 }




[PATCH 2/8] rhashtable: remove nulls_base and related code.

2018-05-03 Thread NeilBrown
This "feature" is unused, undocumented, and untested and so
doesn't really belong.  If a use for the nulls marker
is found, all this code would need to be reviewed to
ensure it works as required.  It would be just as easy to
just add the code if/when it is needed instead.

This patch actually fixes a bug too.  The table resizing allows a
table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
any growth beyond 2^27 is wasteful an ineffective.

This patch result in NULLS_MARKER(0) being used for all chains,
and leave the use of rht_is_a_null() to test for it.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   35 +++
 lib/rhashtable.c   |8 
 lib/test_rhashtable.c  |5 +
 3 files changed, 4 insertions(+), 44 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 4e1f535c2034..8822924dd05a 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -29,25 +29,8 @@
 
 /*
  * The end of the chain is marked with a special nulls marks which has
- * the following format:
- *
- * +---+-+-+
- * | Base  |  Hash   |1|
- * +---+-+-+
- *
- * Base (4 bits) : Reserved to distinguish between multiple tables.
- * Specified via &struct rhashtable_params.nulls_base.
- * Hash (27 bits): Full hash (unmasked) of first element added to bucket
- * 1 (1 bit) : Nulls marker (always set)
- *
- * The remaining bits of the next pointer remain unused for now.
+ * the least significant bit set.
  */
-#define RHT_BASE_BITS  4
-#define RHT_HASH_BITS  27
-#define RHT_BASE_SHIFT RHT_HASH_BITS
-
-/* Base bits plus 1 bit for nulls marker */
-#define RHT_HASH_RESERVED_SPACE(RHT_BASE_BITS + 1)
 
 /* Maximum chain length before rehash
  *
@@ -129,7 +112,6 @@ struct rhashtable;
  * @min_size: Minimum size while shrinking
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
  * @automatic_shrinking: Enable automatic shrinking of tables
- * @nulls_base: Base value to generate nulls marker
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
  * @obj_cmpfn: Function to compare key with object
@@ -143,7 +125,6 @@ struct rhashtable_params {
u16 min_size;
boolautomatic_shrinking;
u8  locks_mul;
-   u32 nulls_base;
rht_hashfn_thashfn;
rht_obj_hashfn_tobj_hashfn;
rht_obj_cmpfn_t obj_cmpfn;
@@ -210,24 +191,14 @@ struct rhashtable_iter {
bool end_of_table;
 };
 
-static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
-{
-   return NULLS_MARKER(ht->p.nulls_base + hash);
-}
-
 #define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
-   ((ptr) = (typeof(ptr)) rht_marker(ht, hash))
+   ((ptr) = (typeof(ptr)) NULLS_MARKER(0))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
 {
return ((unsigned long) ptr & 1);
 }
 
-static inline unsigned long rht_get_nulls_value(const struct rhash_head *ptr)
-{
-   return ((unsigned long) ptr) >> 1;
-}
-
 static inline void *rht_obj(const struct rhashtable *ht,
const struct rhash_head *he)
 {
@@ -237,7 +208,7 @@ static inline void *rht_obj(const struct rhashtable *ht,
 static inline unsigned int rht_bucket_index(const struct bucket_table *tbl,
unsigned int hash)
 {
-   return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1);
+   return hash & (tbl->size - 1);
 }
 
 static inline unsigned int rht_key_get_hash(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..4a3f94e8e8a6 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -994,7 +994,6 @@ static u32 rhashtable_jhash2(const void *key, u32 length, 
u32 seed)
  * .key_offset = offsetof(struct test_obj, key),
  * .key_len = sizeof(int),
  * .hashfn = jhash,
- * .nulls_base = (1U << RHT_BASE_SHIFT),
  * };
  *
  * Configuration Example 2: Variable length keys
@@ -1028,9 +1027,6 @@ int rhashtable_init(struct rhashtable *ht,
(params->obj_hashfn && !params->obj_cmpfn))
return -EINVAL;
 
-   if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))
-   return -EINVAL;
-
memset(ht, 0, sizeof(*ht));
mutex_init(&ht->mutex);
spin_lock_init(&ht->lock);
@@ -1095,10 +1091,6 @@ int rhltable_init(struct rhltable *hlt, const struct 
rhashtable_params *params)
 {
int err;
 
-   /* No rhlist NULLs marking for now. */
-  

[PATCH 3/8] rhashtable: use cmpxchg() to protect ->future_tbl.

2018-05-03 Thread NeilBrown
Rather than borrowing one of the bucket locks to
protect ->future_tbl updates, use cmpxchg().
This gives more freedom to change how bucket locking
is implemented.

Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 4a3f94e8e8a6..b73afe1dec7e 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -298,21 +298,16 @@ static int rhashtable_rehash_attach(struct rhashtable *ht,
struct bucket_table *old_tbl,
struct bucket_table *new_tbl)
 {
-   /* Protect future_tbl using the first bucket lock. */
-   spin_lock_bh(old_tbl->locks);
-
-   /* Did somebody beat us to it? */
-   if (rcu_access_pointer(old_tbl->future_tbl)) {
-   spin_unlock_bh(old_tbl->locks);
-   return -EEXIST;
-   }
-
/* Make insertions go into the new, empty table right away. Deletions
 * and lookups will be attempted in both tables until we synchronize.
+* The use of 'tmp' is simply to ensure we get the required memory
+* barriers before the cmpxchg().
 */
-   rcu_assign_pointer(old_tbl->future_tbl, new_tbl);
+   struct bucket_table *tmp;
 
-   spin_unlock_bh(old_tbl->locks);
+   rcu_assign_pointer(tmp, new_tbl);
+   if (cmpxchg(&old_tbl->future_tbl, NULL, tmp) != NULL)
+   return -EEXIST;
 
return 0;
 }




[PATCH 6/8] rhashtable: further improve stability of rhashtable_walk

2018-05-03 Thread NeilBrown
If the sequence:
   obj = rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);
   rhashtable_remove_fast(ht, &obj->head, params);
   rhashtable_walk_start(iter);

 races with another thread inserting or removing
 an object on the same hash chain, a subsequent
 rhashtable_walk_next() is not guaranteed to get the "next"
 object. It is possible that an object could be
 repeated, or missed.

 This can be made more reliable by keeping the objects in a hash chain
 sorted by memory address.  A subsequent rhashtable_walk_next()
 call can reliably find the correct position in the list, and thus
 find the 'next' object.

 It is not possible (certainly not so easy) to achieve this with an
 rhltable as keeping the hash chain in order is not so easy.  When the
 first object with a given key is removed, it is replaced in the chain
 with the next object with the same key, and the address of that
 object may not be correctly ordered.
 No current user of rhltable_walk_enter() calls
 rhashtable_walk_start() more than once, so no current code
 could benefit from a more reliable walk of rhltables.

 This patch only attempts to improve walks for rhashtables.
 - a new object is always inserted after the last object with a
   smaller address, or at the start
 - when rhashtable_walk_start() is called, it records that 'p' is not
   'safe', meaning that it cannot be dereferenced.  The revalidation
   that was previously done here is moved to rhashtable_walk_next()
 - when rhashtable_walk_next() is called while p is not NULL and not
   safe, it walks the chain looking for the first object with an
   address greater than p and returns that.  If there is none, it moves
   to the next hash chain.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   11 +-
 lib/rhashtable.c   |   82 
 2 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 5091abf975a1..20684a451cb0 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -188,6 +188,7 @@ struct rhashtable_iter {
struct rhashtable_walker walker;
unsigned int slot;
unsigned int skip;
+   bool p_is_unsafe;
bool end_of_table;
 };
 
@@ -737,7 +738,12 @@ static inline void *__rhashtable_insert_fast(
(params.obj_cmpfn ?
 params.obj_cmpfn(&arg, rht_obj(ht, head)) :
 rhashtable_compare(&arg, rht_obj(ht, head {
-   pprev = &head->next;
+   if (rhlist) {
+   pprev = &head->next;
+   } else {
+   if (head < obj)
+   pprev = &head->next;
+   }
continue;
}
 
@@ -1233,7 +1239,8 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * Note that if you restart a walk after rhashtable_walk_stop you
  * may see the same object twice.  Also, you may miss objects if
  * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start.  Note that this is different to
+ * rhashtable_walk_enter() which misses objects.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 83f5d1ebf452..038c4156b66a 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -234,6 +234,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, 
unsigned int old_hash)
struct bucket_table *new_tbl = rhashtable_last_table(ht,
rht_dereference_rcu(old_tbl->future_tbl, ht));
struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
+   struct rhash_head __rcu **inspos;
int err = -EAGAIN;
struct rhash_head *head, *next, *entry;
spinlock_t *new_bucket_lock;
@@ -262,12 +263,15 @@ static int rhashtable_rehash_one(struct rhashtable *ht, 
unsigned int old_hash)
new_bucket_lock = rht_bucket_lock(new_tbl, new_hash);
 
spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
-   head = rht_dereference_bucket(new_tbl->buckets[new_hash],
- new_tbl, new_hash);
-
+   inspos = &new_tbl->buckets[new_hash];
+   head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+   while (!rht_is_a_nulls(head) && head < entry) {
+   inspos = &head->next;
+   head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+   }
RCU_INIT_POINTER(entry->next, head);
 
-   rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
+   rcu_assign_pointer(*inspos, entry);
spin_unlock(new_bucket_lock);
 
rcu_ass

[PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-03 Thread NeilBrown
If two threads run nested_table_alloc() at the same time
they could both allocate a new table.
Best case is that one of them will never be freed, leaking memory.
Worst case is hat entry get stored there before it leaks,
and the are lost from the table.

So use cmpxchg to detect the race and free the unused table.

Fixes: da20420f83ea ("rhashtable: Add nested tables")
Cc: sta...@vger.kernel.org # 4.11+
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index b73afe1dec7e..114e6090228a 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -119,6 +119,7 @@ static union nested_table *nested_table_alloc(struct 
rhashtable *ht,
  unsigned int nhash)
 {
union nested_table *ntbl;
+   union nested_table *tmp;
int i;
 
ntbl = rcu_dereference(*prev);
@@ -133,9 +134,12 @@ static union nested_table *nested_table_alloc(struct 
rhashtable *ht,
(i << shifted) | nhash);
}
 
-   rcu_assign_pointer(*prev, ntbl);
-
-   return ntbl;
+   rcu_assign_pointer(tmp, ntbl);
+   if (cmpxchg(prev, NULL, tmp) == NULL)
+   return tmp;
+   /* Raced with another thread. */
+   kfree(ntbl);
+   return rcu_dereference(*prev);
 }
 
 static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht,




[PATCH 7/8] rhashtable: add rhashtable_walk_prev()

2018-05-03 Thread NeilBrown
rhashtable_walk_prev() returns the object returned by
the previous rhashtable_walk_next(), providing it is still in the
table (or was during this grace period).
This works even if rhashtable_walk_stop() and rhashtable_talk_start()
have been called since the last rhashtable_walk_next().

If there have been no calls to rhashtable_walk_next(), or if the
object is gone from the table, then NULL is returned.

This can usefully be used in a seq_file ->start() function.
If the pos is the same as was returned by the last ->next() call,
then rhashtable_walk_prev() can be used to re-establish the
current location in the table.  If it returns NULL, then
rhashtable_walk_next() should be used.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |1 +
 lib/rhashtable.c   |   31 +++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 20684a451cb0..82d061ff96d6 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -367,6 +367,7 @@ static inline void rhashtable_walk_start(struct 
rhashtable_iter *iter)
 }
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
+void *rhashtable_walk_prev(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 038c4156b66a..d0267e37e7e1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -921,6 +921,37 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
+/**
+ * rhashtable_walk_prev - Return the previously returned object, if available
+ * @iter:  Hash table iterator
+ *
+ * If rhashtable_walk_next() has previously been called and the object
+ * it returned is still in the hash table, that object is returned again,
+ * otherwise %NULL is returned.
+ *
+ * If the recent rhashtable_walk_next() call was since the most recent
+ * rhashtable_walk_start() call then the returned object may not, strictly
+ * speaking, still be in the table.  It will be safe to dereference.
+ *
+ * Note that the iterator is not changed and in particular it does not
+ * step backwards.
+ */
+void *rhashtable_walk_prev(struct rhashtable_iter *iter)
+{
+   struct rhashtable *ht = iter->ht;
+   struct rhash_head *p = iter->p;
+
+   if (!p)
+   return NULL;
+   if (!iter->p_is_unsafe || ht->rhlist)
+   return p;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot)
+   if (p == iter->p)
+   return p;
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_prev);
+
 /**
  * rhashtable_walk_stop - Finish a hash table walk
  * @iter:  Hash table iterator




[PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-03 Thread NeilBrown
rhashtable_try_insert() currently hold a lock on the bucket in
the first table, while also locking buckets in subsequent tables.
This is unnecessary and looks like a hold-over from some earlier
version of the implementation.

As insert and remove always lock a bucket in each table in turn, and
as insert only inserts in the final table, there cannot be any races
that are not covered by simply locking a bucket in each table in turn.

When an insert call reaches that last table it can be sure that there
is no match entry in any other table as it has searched them all, and
insertion never happens anywhere but in the last table.  The fact that
code tests for the existence of future_tbl while holding a lock on
the relevant bucket ensures that two threads inserting the same key
will make compatible decisions about which is the "last" table.

This simplifies the code and allows the ->rehash field to be
discarded.
We still need a way to ensure that a dead bucket_table is never
re-linked by rhashtable_walk_stop().  This can be achieved by
setting the ->size to 1.  This still allows lookup code to work (and
correctly not find anything) but can never happen on an active bucket
table (as the minimum size is 4).

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   13 -
 lib/rhashtable.c   |   42 ++
 2 files changed, 10 insertions(+), 45 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 82d061ff96d6..0529925af41d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -73,7 +73,6 @@ struct rhlist_head {
 struct bucket_table {
unsigned intsize;
unsigned intnest;
-   unsigned intrehash;
u32 hash_rnd;
unsigned intlocks_mask;
spinlock_t  *locks;
@@ -885,12 +884,6 @@ static inline int rhltable_insert(
  * @obj:   pointer to hash head inside object
  * @params:hash table parameters
  *
- * Locks down the bucket chain in both the old and new table if a resize
- * is in progress to ensure that writers can't remove from the old table
- * and can't insert to the new table during the atomic operation of search
- * and insertion. Searches for duplicates in both the old and new table if
- * a resize is in progress.
- *
  * This lookup function may only be used for fixed key hash table (key_len
  * parameter set). It will BUG() if used inappropriately.
  *
@@ -946,12 +939,6 @@ static inline void *rhashtable_lookup_get_insert_fast(
  * @obj:   pointer to hash head inside object
  * @params:hash table parameters
  *
- * Locks down the bucket chain in both the old and new table if a resize
- * is in progress to ensure that writers can't remove from the old table
- * and can't insert to the new table during the atomic operation of search
- * and insertion. Searches for duplicates in both the old and new table if
- * a resize is in progress.
- *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
  * Will trigger an automatic deferred table resizing if residency in the
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index d0267e37e7e1..4f7a7423a675 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -293,10 +293,9 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
while (!(err = rhashtable_rehash_one(ht, old_hash)))
;
 
-   if (err == -ENOENT) {
-   old_tbl->rehash++;
+   if (err == -ENOENT)
err = 0;
-   }
+
spin_unlock_bh(old_bucket_lock);
 
return err;
@@ -345,6 +344,9 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
spin_lock(&ht->lock);
list_for_each_entry(walker, &old_tbl->walkers, list)
walker->tbl = NULL;
+
+   /* Ensure rhashtable_walk_stop() doesn't relink this table */
+   old_tbl->size = 1;
spin_unlock(&ht->lock);
 
/* Wait for readers. All new readers will see the new
@@ -597,36 +599,14 @@ static void *rhashtable_try_insert(struct rhashtable *ht, 
const void *key,
struct bucket_table *new_tbl;
struct bucket_table *tbl;
unsigned int hash;
-   spinlock_t *lock;
void *data;
 
-   tbl = rcu_dereference(ht->tbl);
-
-   /* All insertions must grab the oldest table containing
-* the hashed bucket that is yet to be rehashed.
-*/
-   for (;;) {
-   hash = rht_head_hashfn(ht, tbl, obj, ht->p);
-   lock = rht_bucket_lock(tbl, hash);
-   spin_lock_bh(lock);
-
-   if (tbl->rehash <= hash)
-   break;
-
-   spin_unlock_bh(lock);
-   tbl = rcu_dereference(tbl->future_tbl);
-   }
-
-   data = rhashtable_lookup_one(ht, tbl, hash, key, obj);
-   

[PATCH 5/8] rhashtable: remove rhashtable_walk_peek()

2018-05-03 Thread NeilBrown
This function has a somewhat confused behavior that is not properly
described by the documentation.
Sometimes is returns the previous object, sometimes it returns the
next one.
Sometimes it changes the iterator, sometimes it doesn't.

This function is not currently used and is not worth keeping, so
remove it.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |1 -
 lib/rhashtable.c   |   34 --
 2 files changed, 35 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 8822924dd05a..5091abf975a1 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -366,7 +366,6 @@ static inline void rhashtable_walk_start(struct 
rhashtable_iter *iter)
 }
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
-void *rhashtable_walk_peek(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 114e6090228a..83f5d1ebf452 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -897,40 +897,6 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
-/**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
- * @iter:  Hash table iterator
- *
- * Returns the next object or NULL when the end of the table is reached.
- *
- * Returns -EAGAIN if resize event occurred.  Note that the iterator
- * will rewind back to the beginning and you may continue to use it.
- */
-void *rhashtable_walk_peek(struct rhashtable_iter *iter)
-{
-   struct rhlist_head *list = iter->list;
-   struct rhashtable *ht = iter->ht;
-   struct rhash_head *p = iter->p;
-
-   if (p)
-   return rht_obj(ht, ht->rhlist ? &list->rhead : p);
-
-   /* No object found in current iter, find next one in the table. */
-
-   if (iter->skip) {
-   /* A nonzero skip value points to the next entry in the table
-* beyond that last one that was found. Decrement skip so
-* we find the current value. __rhashtable_walk_find_next
-* will restore the original value of skip assuming that
-* the table hasn't changed.
-*/
-   iter->skip--;
-   }
-
-   return __rhashtable_walk_find_next(iter);
-}
-EXPORT_SYMBOL_GPL(rhashtable_walk_peek);
-
 /**
  * rhashtable_walk_stop - Finish a hash table walk
  * @iter:  Hash table iterator




[PATCH 1/4] rhashtable: remove outdated comments about grow_decision etc

2018-04-23 Thread NeilBrown
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1f8ad121eb43..87d443a5b11d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -836,9 +836,8 @@ static inline void *__rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -866,9 +865,8 @@ static inline int rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert_key(
struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -890,9 +888,8 @@ static inline int rhltable_insert_key(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert(
struct rhltable *hlt, struct rhlist_head *list,
@@ -922,9 +919,8 @@ static inline int rhltable_insert(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_lookup_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -981,9 +977,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
  *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  *
  * Returns zero on success.
  */
@@ -1134,8 +1129,8 @@ static inline int __rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */
@@ -1156,8 +1151,8 @@ static inline int rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */




[PATCH 0/4] A few rhashtables cleanups

2018-04-23 Thread NeilBrown
2 patches fixes documentation
1 fixes a bit in rhashtable_walk_start()
1 improves rhashtable_walk stability.

All reviewed and Acked.

Thanks,
NeilBrown


---

NeilBrown (4):
  rhashtable: remove outdated comments about grow_decision etc
  rhashtable: Revise incorrect comment on r{hl,hash}table_walk_enter()
  rhashtable: reset iter when rhashtable_walk_start sees new table
  rhashtable: improve rhashtable_walk stability when stop/start used.


 include/linux/rhashtable.h |   38 +++--
 lib/rhashtable.c   |   51 
 2 files changed, 63 insertions(+), 26 deletions(-)

--
Signature



[PATCH 4/4] rhashtable: improve rhashtable_walk stability when stop/start used.

2018-04-23 Thread NeilBrown
When a walk of an rhashtable is interrupted with rhastable_walk_stop()
and then rhashtable_walk_start(), the location to restart from is based
on a 'skip' count in the current hash chain, and this can be incorrect
if insertions or deletions have happened.  This does not happen when
the walk is not stopped and started as iter->p is a placeholder which
is safe to use while holding the RCU read lock.

In rhashtable_walk_start() we can revalidate that 'p' is still in the
same hash chain.  If it isn't then the current method is still used.

With this patch, if a rhashtable walker ensures that the current
object remains in the table over a stop/start period (possibly by
elevating the reference count if that is sufficient), it can be sure
that a walk will not miss objects that were in the hashtable for the
whole time of the walk.

rhashtable_walk_start() may not find the object even though it is
still in the hashtable if a rehash has moved it to a new table.  In
this case it will (eventually) get -EAGAIN and will need to proceed
through the whole table again to be sure to see everything at least
once.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   44 +---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 81edf1ab38ab..9427b5766134 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -727,6 +727,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
__acquires(RCU)
 {
struct rhashtable *ht = iter->ht;
+   bool rhlist = ht->rhlist;
 
rcu_read_lock();
 
@@ -735,13 +736,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
list_del(&iter->walker.list);
spin_unlock(&ht->lock);
 
-   if (!iter->walker.tbl && !iter->end_of_table) {
+   if (iter->end_of_table)
+   return 0;
+   if (!iter->walker.tbl) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
iter->slot = 0;
iter->skip = 0;
return -EAGAIN;
}
 
+   if (iter->p && !rhlist) {
+   /*
+* We need to validate that 'p' is still in the table, and
+* if so, update 'skip'
+*/
+   struct rhash_head *p;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   skip++;
+   if (p == iter->p) {
+   iter->skip = skip;
+   goto found;
+   }
+   }
+   iter->p = NULL;
+   } else if (iter->p && rhlist) {
+   /* Need to validate that 'list' is still in the table, and
+* if so, update 'skip' and 'p'.
+*/
+   struct rhash_head *p;
+   struct rhlist_head *list;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   for (list = container_of(p, struct rhlist_head, rhead);
+list;
+list = rcu_dereference(list->next)) {
+   skip++;
+   if (list == iter->list) {
+   iter->p = p;
+   skip = skip;
+   goto found;
+   }
+   }
+   }
+   iter->p = NULL;
+   }
+found:
return 0;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);
@@ -917,8 +957,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
iter->walker.tbl = NULL;
spin_unlock(&ht->lock);
 
-   iter->p = NULL;
-
 out:
rcu_read_unlock();
 }




[PATCH 3/4] rhashtable: reset iter when rhashtable_walk_start sees new table

2018-04-23 Thread NeilBrown
The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table.  This is not true.  We need to set ->slot and
->skip to be zero for it to be true.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6d490f51174e..81edf1ab38ab 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -737,6 +737,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
 
if (!iter->walker.tbl && !iter->end_of_table) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+   iter->slot = 0;
+   iter->skip = 0;
return -EAGAIN;
}
 




[PATCH 2/4] rhashtable: Revise incorrect comment on r{hl, hash}table_walk_enter()

2018-04-23 Thread NeilBrown
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
they do take a spinlock without irq protection.
So revise the comments to accurately state the contexts in which
these functions can be called.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |5 +++--
 lib/rhashtable.c   |5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 87d443a5b11d..4e1f535c2034 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1268,8 +1268,9 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
+ * This function may be called from any process context, including
+ * non-preemptable context, but cannot be called from softirq or
+ * hardirq context.
  *
  * You must call rhashtable_walk_exit after this function returns.
  */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2b2b79974b61..6d490f51174e 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -668,8 +668,9 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
+ * This function may be called from any process context, including
+ * non-preemptable context, but cannot be called from softirq or
+ * hardirq context.
  *
  * You must call rhashtable_walk_exit after this function returns.
  */




Re: [PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc

2018-04-22 Thread NeilBrown
On Wed, Apr 18 2018, David Miller wrote:

> From: NeilBrown 
> Date: Thu, 19 Apr 2018 09:09:05 +1000
>
>> On Wed, Apr 18 2018, Herbert Xu wrote:
>> 
>>> On Wed, Apr 18, 2018 at 04:47:01PM +1000, NeilBrown wrote:
>>>> grow_decision and shink_decision no longer exist, so remove
>>>> the remaining references to them.
>>>> 
>>>> Signed-off-by: NeilBrown 
>>>
>>> Acked-by: Herbert Xu 
>> 
>> Thanks.  Is that Ack sufficient for this patch to go upstream, or is
>> there something else that I need to do?
>
> One patch being ACK'd does not release the whole series to be applied
> and the whole series will be treated as a complete unit for that
> purpose.
>
> So if discussion is holding up one patch in the series, it holds up
> the entire series.
>
> So get the entire series in acceptable condition, or submit only one
> change at a time individually and wait for that one to be accepted
> before you submit and ask for feedback on the next one.
>
> I hope that makes things clear for you.

Yes, nice and clear.  Thanks a lot!

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-22 Thread NeilBrown
On Thu, Apr 19 2018, Herbert Xu wrote:

> On Thu, Apr 19, 2018 at 08:56:28AM +1000, NeilBrown wrote:
>>
>> I don't want to do that - I just want the documentation to be correct
>> (or at least, not be blatantly incorrect).  The function does not sleep,
>> and is safe to call with spin locks held.
>> Do we need to spell out when it can be called?  If so, maybe:
>> 
>>This function may be called from any process context, including
>>non-preemptable context, but cannot be called from interrupts.
>
> Just to make it perfectly clear, how about "cannot be called from
> softirq or hardirq context"? Previously the not able to sleep part
> completely ruled out any ambiguity but the new wording could confuse
> people into thinking that this can be called from softirq context
> where it would be unsafe if mixed with process context usage.
>

Sound fair.  Could you Ack the following?  Then I'll resend all the
patches that have an ack.
I've realised that the "further improve stability of rhashtable_walk"
patch isn't actually complete, so I'll withdraw that for now.

Thanks,
NeilBrown

From e16c037398b6c057787437f3a0aaa7fd44c3bee3 Mon Sep 17 00:00:00 2001
From: NeilBrown 
Date: Mon, 16 Apr 2018 11:05:39 +1000
Subject: [PATCH] rhashtable: Revise incorrect comment on
 r{hl,hash}table_walk_enter()

Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
they do take a spinlock without irq protection.
So revise the comments to accurately state the contexts in which
these functions can be called.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h | 5 +++--
 lib/rhashtable.c   | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 87d443a5b11d..4e1f535c2034 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1268,8 +1268,9 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
+ * This function may be called from any process context, including
+ * non-preemptable context, but cannot be called from softirq or
+ * hardirq context.
  *
  * You must call rhashtable_walk_exit after this function returns.
  */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2b2b79974b61..6d490f51174e 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -668,8 +668,9 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
+ * This function may be called from any process context, including
+ * non-preemptable context, but cannot be called from softirq or
+ * hardirq context.
  *
  * You must call rhashtable_walk_exit after this function returns.
  */
-- 
2.14.0.rc0.dirty



signature.asc
Description: PGP signature


Re: [PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc

2018-04-18 Thread NeilBrown
On Wed, Apr 18 2018, Herbert Xu wrote:

> On Wed, Apr 18, 2018 at 04:47:01PM +1000, NeilBrown wrote:
>> grow_decision and shink_decision no longer exist, so remove
>> the remaining references to them.
>> 
>> Signed-off-by: NeilBrown 
>
> Acked-by: Herbert Xu 

Thanks.  Is that Ack sufficient for this patch to go upstream, or is
there something else that I need to do?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 6/6] rhashtable: add rhashtable_walk_prev()

2018-04-18 Thread NeilBrown
On Wed, Apr 18 2018, Herbert Xu wrote:

> On Wed, Apr 18, 2018 at 04:47:02PM +1000, NeilBrown wrote:
>> rhashtable_walk_prev() returns the object returned by
>> the previous rhashtable_walk_next(), providing it is still in the
>> table (or was during this grace period).
>> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
>> have been called since the last rhashtable_walk_next().
>> 
>> If there have been no calls to rhashtable_walk_next(), or if the
>> object is gone from the table, then NULL is returned.
>> 
>> This can usefully be used in a seq_file ->start() function.
>> If the pos is the same as was returned by the last ->next() call,
>> then rhashtable_walk_prev() can be used to re-establish the
>> current location in the table.  If it returns NULL, then
>> rhashtable_walk_next() should be used.
>> 
>> Signed-off-by: NeilBrown 
>
> Can you explain the need for this function and its difference
> from the existing rhashtable_walk_peek?

The need is essentially the same as the need for
rhashtable_walk_peek().  The difference is that the actual behaviour can
be documented briefly (and so understood easily) without enumerating
multiple special cases.
rhashtable_walk_peek() is much the same as
 rhashtable_walk_prev() ?: rhashtable_walk_next()

The need arises when you need to "stop" a walk and then restart at the
same object, not the next one. i.e. the last object returned before the
"stop" wasn't acted on.
This happens with seq_file if the buffer space for ->show() is not
sufficient to format an object.  In the case, seq_file will stop() the
iteration, make more space available (either by flushing or by
reallocing) and will start again at the same location.
If the seq_file client stored the rhashtable_iter in the seq_file
private data, it can use rhasthable_walk_prev() to recover its position
if that object is still in the hash table.  If it isn't still present,
rhashtable_walk_next() can be used to get the next one.  In some cases
it can be useful for the client to know whether it got the previous one
or not.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-18 Thread NeilBrown
On Wed, Apr 18 2018, Herbert Xu wrote:

> On Wed, Apr 18, 2018 at 04:47:01PM +1000, NeilBrown wrote:
>> Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, so
>> remove the comments which suggest that they do.
>> 
>> Signed-off-by: NeilBrown 
>> ---
>>  include/linux/rhashtable.h |3 ---
>>  lib/rhashtable.c   |3 ---
>>  2 files changed, 6 deletions(-)
>> 
>> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
>> index 87d443a5b11d..b01d88e196c2 100644
>> --- a/include/linux/rhashtable.h
>> +++ b/include/linux/rhashtable.h
>> @@ -1268,9 +1268,6 @@ static inline int rhashtable_walk_init(struct 
>> rhashtable *ht,
>>   * For a completely stable walk you should construct your own data
>>   * structure outside the hash table.
>>   *
>> - * This function may sleep so you must not call it from interrupt
>> - * context or with spin locks held.
>
> It does a naked spin lock so even though we removed the memory
> allocation you still mustn't call it from interrupt context.
>
> Why do you need to do that anyway?

I don't want to do that - I just want the documentation to be correct
(or at least, not be blatantly incorrect).  The function does not sleep,
and is safe to call with spin locks held.
Do we need to spell out when it can be called?  If so, maybe:

   This function may be called from any process context, including
   non-preemptable context, but cannot be called from interrupts.

??

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH 0/6] Assorted rhashtable improvements. RESEND

2018-04-18 Thread NeilBrown
[[ I mistyped linux-kernel the first time I sent these, so
   resending.  Please reply to this set.  Sorry - neilb ]]


Some of these have been posted before and a couple
received an Ack from Herbert, but haven't appeared in any git tree
yet.
Another (the first) has been sent but received no ack.

I've added the second patch, which removes more incorrect
documentation, and added the last two patches.

One further improves rhashtable_walk stability.
The last added rhashtable_walk_prev(), as discussed with Herbert,
which should be useful for seq_files.
(Separately I've posted a patch to Al Viro to make seq_file even
easier to use with rhashtables, but this series does not depend
on that patch).

I don't see these patches as particularly urgent, though the third is a
bugfix that currently prevents me from allowing one rhashtable in
lustre to auto-shrink.

I previously suggested it might be good for some of these patches to
go upstream through 'staging' with the lustre patches.  I no longer
think that is necessary.  It is probably best for them to go upstream
through net or net-next.

Thanks,
NeilBrown


---

NeilBrown (6):
  rhashtable: remove outdated comments about grow_decision etc
  rhashtable: remove incorrect comment on r{hl,hash}table_walk_enter()
  rhashtable: reset iter when rhashtable_walk_start sees new table
  rhashtable: improve rhashtable_walk stability when stop/start used.
  rhashtable: further improve stability of rhashtable_walk
  rhashtable: add rhashtable_walk_prev()


 include/linux/rhashtable.h |   49 +-
 lib/rhashtable.c   |  121 +---
 2 files changed, 126 insertions(+), 44 deletions(-)

--
Signature



[PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc

2018-04-18 Thread NeilBrown
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1f8ad121eb43..87d443a5b11d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -836,9 +836,8 @@ static inline void *__rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -866,9 +865,8 @@ static inline int rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert_key(
struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -890,9 +888,8 @@ static inline int rhltable_insert_key(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert(
struct rhltable *hlt, struct rhlist_head *list,
@@ -922,9 +919,8 @@ static inline int rhltable_insert(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_lookup_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -981,9 +977,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
  *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  *
  * Returns zero on success.
  */
@@ -1134,8 +1129,8 @@ static inline int __rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */
@@ -1156,8 +1151,8 @@ static inline int rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */




[PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-18 Thread NeilBrown
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, so
remove the comments which suggest that they do.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |3 ---
 lib/rhashtable.c   |3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 87d443a5b11d..b01d88e196c2 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1268,9 +1268,6 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
  * You must call rhashtable_walk_exit after this function returns.
  */
 static inline void rhltable_walk_enter(struct rhltable *hlt,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2b2b79974b61..19db8e563c40 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -668,9 +668,6 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
  * You must call rhashtable_walk_exit after this function returns.
  */
 void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter)




[PATCH 4/6] rhashtable: improve rhashtable_walk stability when stop/start used.

2018-04-18 Thread NeilBrown
When a walk of an rhashtable is interrupted with rhastable_walk_stop()
and then rhashtable_walk_start(), the location to restart from is based
on a 'skip' count in the current hash chain, and this can be incorrect
if insertions or deletions have happened.  This does not happen when
the walk is not stopped and started as iter->p is a placeholder which
is safe to use while holding the RCU read lock.

In rhashtable_walk_start() we can revalidate that 'p' is still in the
same hash chain.  If it isn't then the current method is still used.

With this patch, if a rhashtable walker ensures that the current
object remains in the table over a stop/start period (possibly by
elevating the reference count if that is sufficient), it can be sure
that a walk will not miss objects that were in the hashtable for the
whole time of the walk.

rhashtable_walk_start() may not find the object even though it is
still in the hashtable if a rehash has moved it to a new table.  In
this case it will (eventually) get -EAGAIN and will need to proceed
through the whole table again to be sure to see everything at least
once.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   44 +---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 28e1be9f681b..16cde54a553b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -723,6 +723,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
__acquires(RCU)
 {
struct rhashtable *ht = iter->ht;
+   bool rhlist = ht->rhlist;
 
rcu_read_lock();
 
@@ -731,13 +732,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
list_del(&iter->walker.list);
spin_unlock(&ht->lock);
 
-   if (!iter->walker.tbl && !iter->end_of_table) {
+   if (iter->end_of_table)
+   return 0;
+   if (!iter->walker.tbl) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
iter->slot = 0;
iter->skip = 0;
return -EAGAIN;
}
 
+   if (iter->p && !rhlist) {
+   /*
+* We need to validate that 'p' is still in the table, and
+* if so, update 'skip'
+*/
+   struct rhash_head *p;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   skip++;
+   if (p == iter->p) {
+   iter->skip = skip;
+   goto found;
+   }
+   }
+   iter->p = NULL;
+   } else if (iter->p && rhlist) {
+   /* Need to validate that 'list' is still in the table, and
+* if so, update 'skip' and 'p'.
+*/
+   struct rhash_head *p;
+   struct rhlist_head *list;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   for (list = container_of(p, struct rhlist_head, rhead);
+list;
+list = rcu_dereference(list->next)) {
+   skip++;
+   if (list == iter->list) {
+   iter->p = p;
+   skip = skip;
+   goto found;
+   }
+   }
+   }
+   iter->p = NULL;
+   }
+found:
return 0;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);
@@ -913,8 +953,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
iter->walker.tbl = NULL;
spin_unlock(&ht->lock);
 
-   iter->p = NULL;
-
 out:
rcu_read_unlock();
 }




[PATCH 3/6] rhashtable: reset iter when rhashtable_walk_start sees new table

2018-04-18 Thread NeilBrown
The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table.  This is not true.  We need to set ->slot and
->skip to be zero for it to be true.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 19db8e563c40..28e1be9f681b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -733,6 +733,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
 
if (!iter->walker.tbl && !iter->end_of_table) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+   iter->slot = 0;
+   iter->skip = 0;
return -EAGAIN;
}
 




[PATCH 5/6] rhashtable: further improve stability of rhashtable_walk

2018-04-18 Thread NeilBrown
If the sequence:
   obj = rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);
   rhashtable_remove_fast(ht, &obj->head, params);
   rhashtable_walk_start(iter);

 races with another thread inserting or removing
 an object on the same hash chain, a subsequent
 rhashtable_walk_next() is not guaranteed to get the "next"
 object. It is possible that an object could be
 repeated, or missed.

 This can be made more reliable by keeping the objects in a hash chain
 sorted by memory address.  A subsequent rhashtable_walk_next()
 call can reliably find the correct position in the list, and thus
 find the 'next' object.

 It is not possible (certainly not so easy) to achieve this with an
 rhltable as keeping the hash chain in order is not so easy.  When the
 first object with a given key is removed, it is replaced in the chain
 with the next object with the same key, and the address of that
 object may not be correctly ordered.
 No current user of rhltable_walk_enter() calls
 rhashtable_walk_start() more than once, so no current code
 could benefit from a more reliable walk of rhltables.

 This patch only attempts to improve walks for rhashtables.
 - a new object is always inserted after the last object with a
   smaller address, or at the start
 - when rhashtable_walk_start() is called, it records that 'p' is not
   'safe', meaning that it cannot be dereferenced.  The revalidation
   that was previously done here is moved to rhashtable_walk_next()
 - when rhashtable_walk_next() is called while p is not NULL and not
   safe, it walks the chain looking for the first object with an
   address greater than p and returns that.  If there is none, it moves
   to the next hash chain.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   12 ++--
 lib/rhashtable.c   |   66 +++-
 2 files changed, 50 insertions(+), 28 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index b01d88e196c2..5ce6201f246e 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -207,6 +207,7 @@ struct rhashtable_iter {
struct rhashtable_walker walker;
unsigned int slot;
unsigned int skip;
+   bool p_is_unsafe;
bool end_of_table;
 };
 
@@ -730,6 +731,7 @@ static inline void *__rhashtable_insert_fast(
.ht = ht,
.key = key,
};
+   struct rhash_head __rcu **inspos;
struct rhash_head __rcu **pprev;
struct bucket_table *tbl;
struct rhash_head *head;
@@ -757,6 +759,7 @@ static inline void *__rhashtable_insert_fast(
data = ERR_PTR(-ENOMEM);
if (!pprev)
goto out;
+   inspos = pprev;
 
rht_for_each_continue(head, *pprev, tbl, hash) {
struct rhlist_head *plist;
@@ -768,6 +771,8 @@ static inline void *__rhashtable_insert_fast(
 params.obj_cmpfn(&arg, rht_obj(ht, head)) :
 rhashtable_compare(&arg, rht_obj(ht, head {
pprev = &head->next;
+   if (head < obj)
+   inspos = &head->next;
continue;
}
 
@@ -798,7 +803,7 @@ static inline void *__rhashtable_insert_fast(
if (unlikely(rht_grow_above_100(ht, tbl)))
goto slow_path;
 
-   head = rht_dereference_bucket(*pprev, tbl, hash);
+   head = rht_dereference_bucket(*inspos, tbl, hash);
 
RCU_INIT_POINTER(obj->next, head);
if (rhlist) {
@@ -808,7 +813,7 @@ static inline void *__rhashtable_insert_fast(
RCU_INIT_POINTER(list->next, NULL);
}
 
-   rcu_assign_pointer(*pprev, obj);
+   rcu_assign_pointer(*inspos, obj);
 
atomic_inc(&ht->nelems);
if (rht_grow_above_75(ht, tbl))
@@ -1263,7 +1268,8 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * Note that if you restart a walk after rhashtable_walk_stop you
  * may see the same object twice.  Also, you may miss objects if
  * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start.  Note that this is different to
+ * rhashtable_walk_enter() which never leads to missing objects.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 16cde54a553b..be7eb57d9398 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -566,6 +566,10 @@ static struct bucket_table *rhashtable_insert_one(struct 
rhashtable *ht,
return ERR_PTR(-ENOMEM);
 
head = rht_dereference_bucket(*pprev, tbl, hash);
+   while (!rht_is_a_nulls(head) && head < obj) {
+   pprev = &head->next;
+   head = rht_dereference_bucket(*pprev, 

[PATCH 6/6] rhashtable: add rhashtable_walk_prev()

2018-04-18 Thread NeilBrown
rhashtable_walk_prev() returns the object returned by
the previous rhashtable_walk_next(), providing it is still in the
table (or was during this grace period).
This works even if rhashtable_walk_stop() and rhashtable_talk_start()
have been called since the last rhashtable_walk_next().

If there have been no calls to rhashtable_walk_next(), or if the
object is gone from the table, then NULL is returned.

This can usefully be used in a seq_file ->start() function.
If the pos is the same as was returned by the last ->next() call,
then rhashtable_walk_prev() can be used to re-establish the
current location in the table.  If it returns NULL, then
rhashtable_walk_next() should be used.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |1 +
 lib/rhashtable.c   |   30 ++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 5ce6201f246e..b1ad2b6a3f3f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -397,6 +397,7 @@ static inline void rhashtable_walk_start(struct 
rhashtable_iter *iter)
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
 void *rhashtable_walk_peek(struct rhashtable_iter *iter);
+void *rhashtable_walk_prev(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index be7eb57d9398..d2f941146ea3 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -910,6 +910,36 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
+/**
+ * rhashtable_walk_prev - Return the previously returned object, if available
+ * @iter:  Hash table iterator
+ *
+ * If rhashtable_walk_next() has previously been called and the object
+ * it returned is still in the hash table, that object is returned again,
+ * otherwise %NULL is returned.
+ *
+ * If the recent rhashtable_walk_next() call was since the most recent
+ * rhashtable_walk_start() call then the returned object may not, strictly
+ * speaking, still be in the table.  It will be safe to dereference.
+ *
+ * Note that the iterator is not changed and in particular it does not
+ * step backwards.
+ */
+void *rhashtable_walk_prev(struct rhashtable_iter *iter)
+{
+   struct rhashtable *ht = iter->ht;
+   struct rhash_head *p = iter->p;
+
+   if (!p)
+   return NULL;
+   if (!iter->p_is_unsafe || ht->rhlist)
+   return p;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot)
+   if (p == iter->p)
+   return p;
+   return NULL;
+}
+
 /**
  * rhashtable_walk_peek - Return the next object but don't advance the iterator
  * @iter:  Hash table iterator




[PATCH 6/6] rhashtable: add rhashtable_walk_prev()

2018-04-17 Thread NeilBrown
rhashtable_walk_prev() returns the object returned by
the previous rhashtable_walk_next(), providing it is still in the
table (or was during this grace period).
This works even if rhashtable_walk_stop() and rhashtable_talk_start()
have been called since the last rhashtable_walk_next().

If there have been no calls to rhashtable_walk_next(), or if the
object is gone from the table, then NULL is returned.

This can usefully be used in a seq_file ->start() function.
If the pos is the same as was returned by the last ->next() call,
then rhashtable_walk_prev() can be used to re-establish the
current location in the table.  If it returns NULL, then
rhashtable_walk_next() should be used.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |1 +
 lib/rhashtable.c   |   30 ++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 5ce6201f246e..b1ad2b6a3f3f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -397,6 +397,7 @@ static inline void rhashtable_walk_start(struct 
rhashtable_iter *iter)
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
 void *rhashtable_walk_peek(struct rhashtable_iter *iter);
+void *rhashtable_walk_prev(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index be7eb57d9398..d2f941146ea3 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -910,6 +910,36 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
+/**
+ * rhashtable_walk_prev - Return the previously returned object, if available
+ * @iter:  Hash table iterator
+ *
+ * If rhashtable_walk_next() has previously been called and the object
+ * it returned is still in the hash table, that object is returned again,
+ * otherwise %NULL is returned.
+ *
+ * If the recent rhashtable_walk_next() call was since the most recent
+ * rhashtable_walk_start() call then the returned object may not, strictly
+ * speaking, still be in the table.  It will be safe to dereference.
+ *
+ * Note that the iterator is not changed and in particular it does not
+ * step backwards.
+ */
+void *rhashtable_walk_prev(struct rhashtable_iter *iter)
+{
+   struct rhashtable *ht = iter->ht;
+   struct rhash_head *p = iter->p;
+
+   if (!p)
+   return NULL;
+   if (!iter->p_is_unsafe || ht->rhlist)
+   return p;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot)
+   if (p == iter->p)
+   return p;
+   return NULL;
+}
+
 /**
  * rhashtable_walk_peek - Return the next object but don't advance the iterator
  * @iter:  Hash table iterator




[PATCH 5/6] rhashtable: further improve stability of rhashtable_walk

2018-04-17 Thread NeilBrown
If the sequence:
   obj = rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);
   rhashtable_remove_fast(ht, &obj->head, params);
   rhashtable_walk_start(iter);

 races with another thread inserting or removing
 an object on the same hash chain, a subsequent
 rhashtable_walk_next() is not guaranteed to get the "next"
 object. It is possible that an object could be
 repeated, or missed.

 This can be made more reliable by keeping the objects in a hash chain
 sorted by memory address.  A subsequent rhashtable_walk_next()
 call can reliably find the correct position in the list, and thus
 find the 'next' object.

 It is not possible (certainly not so easy) to achieve this with an
 rhltable as keeping the hash chain in order is not so easy.  When the
 first object with a given key is removed, it is replaced in the chain
 with the next object with the same key, and the address of that
 object may not be correctly ordered.
 No current user of rhltable_walk_enter() calls
 rhashtable_walk_start() more than once, so no current code
 could benefit from a more reliable walk of rhltables.

 This patch only attempts to improve walks for rhashtables.
 - a new object is always inserted after the last object with a
   smaller address, or at the start
 - when rhashtable_walk_start() is called, it records that 'p' is not
   'safe', meaning that it cannot be dereferenced.  The revalidation
   that was previously done here is moved to rhashtable_walk_next()
 - when rhashtable_walk_next() is called while p is not NULL and not
   safe, it walks the chain looking for the first object with an
   address greater than p and returns that.  If there is none, it moves
   to the next hash chain.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   12 ++--
 lib/rhashtable.c   |   66 +++-
 2 files changed, 50 insertions(+), 28 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index b01d88e196c2..5ce6201f246e 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -207,6 +207,7 @@ struct rhashtable_iter {
struct rhashtable_walker walker;
unsigned int slot;
unsigned int skip;
+   bool p_is_unsafe;
bool end_of_table;
 };
 
@@ -730,6 +731,7 @@ static inline void *__rhashtable_insert_fast(
.ht = ht,
.key = key,
};
+   struct rhash_head __rcu **inspos;
struct rhash_head __rcu **pprev;
struct bucket_table *tbl;
struct rhash_head *head;
@@ -757,6 +759,7 @@ static inline void *__rhashtable_insert_fast(
data = ERR_PTR(-ENOMEM);
if (!pprev)
goto out;
+   inspos = pprev;
 
rht_for_each_continue(head, *pprev, tbl, hash) {
struct rhlist_head *plist;
@@ -768,6 +771,8 @@ static inline void *__rhashtable_insert_fast(
 params.obj_cmpfn(&arg, rht_obj(ht, head)) :
 rhashtable_compare(&arg, rht_obj(ht, head {
pprev = &head->next;
+   if (head < obj)
+   inspos = &head->next;
continue;
}
 
@@ -798,7 +803,7 @@ static inline void *__rhashtable_insert_fast(
if (unlikely(rht_grow_above_100(ht, tbl)))
goto slow_path;
 
-   head = rht_dereference_bucket(*pprev, tbl, hash);
+   head = rht_dereference_bucket(*inspos, tbl, hash);
 
RCU_INIT_POINTER(obj->next, head);
if (rhlist) {
@@ -808,7 +813,7 @@ static inline void *__rhashtable_insert_fast(
RCU_INIT_POINTER(list->next, NULL);
}
 
-   rcu_assign_pointer(*pprev, obj);
+   rcu_assign_pointer(*inspos, obj);
 
atomic_inc(&ht->nelems);
if (rht_grow_above_75(ht, tbl))
@@ -1263,7 +1268,8 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * Note that if you restart a walk after rhashtable_walk_stop you
  * may see the same object twice.  Also, you may miss objects if
  * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start.  Note that this is different to
+ * rhashtable_walk_enter() which never leads to missing objects.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 16cde54a553b..be7eb57d9398 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -566,6 +566,10 @@ static struct bucket_table *rhashtable_insert_one(struct 
rhashtable *ht,
return ERR_PTR(-ENOMEM);
 
head = rht_dereference_bucket(*pprev, tbl, hash);
+   while (!rht_is_a_nulls(head) && head < obj) {
+   pprev = &head->next;
+   head = rht_dereference_bucket(*pprev, 

[PATCH 0/6] Assorted rhashtable improvements.

2018-04-17 Thread NeilBrown
Some of these have been posted before and a couple
received an Ack from Herbert, but haven't appeared in any git tree
yet.
Another (the first) has been sent but received no ack.

I've added the second patch, which removes more incorrect
documentation, and added the last two patches.

One further improves rhashtable_walk stability.
The last added rhashtable_walk_prev(), as discussed with Herbert,
which should be useful for seq_files.
(Separately I've posted a patch to Al Viro to make seq_file even
easier to use with rhashtables, but this series does not depend
on that patch).

I don't see these patches as particularly urgent, though the third is a
bugfix that currently prevents me from allowing one rhashtable in
lustre to auto-shrink.

I previously suggested it might be good for some of these patches to
go upstream through 'staging' with the lustre patches.  I no longer
think that is necessary.  It is probably best for them to go upstream
through net or net-next.

Thanks,
NeilBrown


---

NeilBrown (6):
  rhashtable: remove outdated comments about grow_decision etc
  rhashtable: remove incorrect comment on r{hl,hash}table_walk_enter()
  rhashtable: reset iter when rhashtable_walk_start sees new table
  rhashtable: improve rhashtable_walk stability when stop/start used.
  rhashtable: further improve stability of rhashtable_walk
  rhashtable: add rhashtable_walk_prev()


 include/linux/rhashtable.h |   49 +-
 lib/rhashtable.c   |  121 +---
 2 files changed, 126 insertions(+), 44 deletions(-)

--
Signature



[PATCH 3/6] rhashtable: reset iter when rhashtable_walk_start sees new table

2018-04-17 Thread NeilBrown
The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table.  This is not true.  We need to set ->slot and
->skip to be zero for it to be true.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 19db8e563c40..28e1be9f681b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -733,6 +733,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
 
if (!iter->walker.tbl && !iter->end_of_table) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+   iter->slot = 0;
+   iter->skip = 0;
return -EAGAIN;
}
 




[PATCH 4/6] rhashtable: improve rhashtable_walk stability when stop/start used.

2018-04-17 Thread NeilBrown
When a walk of an rhashtable is interrupted with rhastable_walk_stop()
and then rhashtable_walk_start(), the location to restart from is based
on a 'skip' count in the current hash chain, and this can be incorrect
if insertions or deletions have happened.  This does not happen when
the walk is not stopped and started as iter->p is a placeholder which
is safe to use while holding the RCU read lock.

In rhashtable_walk_start() we can revalidate that 'p' is still in the
same hash chain.  If it isn't then the current method is still used.

With this patch, if a rhashtable walker ensures that the current
object remains in the table over a stop/start period (possibly by
elevating the reference count if that is sufficient), it can be sure
that a walk will not miss objects that were in the hashtable for the
whole time of the walk.

rhashtable_walk_start() may not find the object even though it is
still in the hashtable if a rehash has moved it to a new table.  In
this case it will (eventually) get -EAGAIN and will need to proceed
through the whole table again to be sure to see everything at least
once.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   44 +---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 28e1be9f681b..16cde54a553b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -723,6 +723,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
__acquires(RCU)
 {
struct rhashtable *ht = iter->ht;
+   bool rhlist = ht->rhlist;
 
rcu_read_lock();
 
@@ -731,13 +732,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
list_del(&iter->walker.list);
spin_unlock(&ht->lock);
 
-   if (!iter->walker.tbl && !iter->end_of_table) {
+   if (iter->end_of_table)
+   return 0;
+   if (!iter->walker.tbl) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
iter->slot = 0;
iter->skip = 0;
return -EAGAIN;
}
 
+   if (iter->p && !rhlist) {
+   /*
+* We need to validate that 'p' is still in the table, and
+* if so, update 'skip'
+*/
+   struct rhash_head *p;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   skip++;
+   if (p == iter->p) {
+   iter->skip = skip;
+   goto found;
+   }
+   }
+   iter->p = NULL;
+   } else if (iter->p && rhlist) {
+   /* Need to validate that 'list' is still in the table, and
+* if so, update 'skip' and 'p'.
+*/
+   struct rhash_head *p;
+   struct rhlist_head *list;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   for (list = container_of(p, struct rhlist_head, rhead);
+list;
+list = rcu_dereference(list->next)) {
+   skip++;
+   if (list == iter->list) {
+   iter->p = p;
+   skip = skip;
+   goto found;
+   }
+   }
+   }
+   iter->p = NULL;
+   }
+found:
return 0;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);
@@ -913,8 +953,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
iter->walker.tbl = NULL;
spin_unlock(&ht->lock);
 
-   iter->p = NULL;
-
 out:
rcu_read_unlock();
 }




[PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc

2018-04-17 Thread NeilBrown
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1f8ad121eb43..87d443a5b11d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -836,9 +836,8 @@ static inline void *__rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -866,9 +865,8 @@ static inline int rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert_key(
struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -890,9 +888,8 @@ static inline int rhltable_insert_key(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert(
struct rhltable *hlt, struct rhlist_head *list,
@@ -922,9 +919,8 @@ static inline int rhltable_insert(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_lookup_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -981,9 +977,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
  *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  *
  * Returns zero on success.
  */
@@ -1134,8 +1129,8 @@ static inline int __rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */
@@ -1156,8 +1151,8 @@ static inline int rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */




[PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-17 Thread NeilBrown
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, so
remove the comments which suggest that they do.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |3 ---
 lib/rhashtable.c   |3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 87d443a5b11d..b01d88e196c2 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1268,9 +1268,6 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
  * You must call rhashtable_walk_exit after this function returns.
  */
 static inline void rhltable_walk_enter(struct rhltable *hlt,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2b2b79974b61..19db8e563c40 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -668,9 +668,6 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
- *
  * You must call rhashtable_walk_exit after this function returns.
  */
 void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter)




Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-04-05 Thread NeilBrown
On Thu, Mar 29 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>>
>> I say "astronomically unlikely", you say "probability .. is extremely
>> low".  I think we are in agreement here.
>> 
>> The point remains that if an error *can* be returned then I have to
>> write code to handle it and test that code.  I'd rather not.
>
> You have be able to handle errors anyway because of memory allocation
> failures.  Ultimately if you keep inserting you will eventually
> fail with ENOMEM.  So I don't see the issue with an additional
> error value.

You don't need to handle memory allocation failures at the point where
you insert into the table - adding to a linked list requires no new
memory.

If you are running out of memory, you will fail to allocate new objects,
so you won't be able to add them to the rhashtable.  Obviously you have
to handle that failure, and that is likely to save rhashtable from
getting many mem-alloc failures.

But once you have allocated a new object, rhashtable should just accept
it.  It doesn't help anyone to have the insertion fail.
The insertion can currently fail even when allocating a new object would
succeed, as assertion can require a GFP_ATOMIC allocation to succeed,
while allocating a new object might only need a GFP_KERNEL allocation to
succeed. 

>
>> > Even if it does happen we won't fail because we will perform
>> > an immediate rehash.  We only fail if it happens right away
>> > after the rehash (that is, at least another 16 elements have
>> > been inserted and you're trying to insert a 17th element, all
>> > while the new hash table has not been completely populated),
>> > which means that somebody has figured out our hash secret and
>> > failing in that case makes sense.
>
> BTW, you didn't acknowledge this bit which I think is crucial to
> how likely such an error is.

The likelihood of the error isn't really the issue - it either can
happen or it cannot.  If it can, it requires code to handle it.

>
>> I never suggested retrying, but I would have to handle it somehow.  I'd
>> rather not.
>
> ...
>
>> While I have no doubt that there are hashtables where someone could try
>> to attack the hash, I am quite sure there are others where is such an
>> attack is meaningless - any code which could generate the required range of
>> keys, could do far worse things more easily.
>
> Our network hashtable has to be secure against adversaries.  I
> understand that this may not be important to your use-case.  However,
> given the fact that the failure would only occur if an adversary
> is present and actively attacking your hash table, I don't think
> it has that much of a negative effect on your use-case either.

I certainly accept that there are circumstances where it is a real
possibility that an adversary might succeed in attacking the hash
function, and changing the seed for each table is an excellent idea.
It isn't clear to me that failing insertions is needed - it is done
rather late and so doesn't throttle much, and could give the attacker
feedback that they had succeeded (?).

Not all hash tables are susceptible to attack.  It would be nice if
developers working in less exposed areas could use rhashtable without ever
having to check for errors.

Error codes are messages from the implementation to the caller.
They should be chosen to help the caller make useful choices, not to
allow the implementation to avoid awkward situations.

>
> Of course if you can reproduce the EBUSY error without your
> disable_count patch or even after you have fixed the issue I
> have pointed out in your disable_count patch you can still reproduce
> it then that would suggest a real bug and we would need to fix it,
> for everyone.

I suspect that I cannot.  However experience tells me that customers do
things that I cannot and would never expect - they can often trigger
errors that I cannot.  It is best if the error cannot be returned.

>
>> Yes, storing a sharded count in the spinlock table does seem like an
>> appropriate granularity.  However that leads me to ask: why do we have
>> the spinlock table?  Why not bit spinlocks in the hashchain head like
>> include/linux/list_bl uses?
>
> The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
> can elucidate this.

Maybe I'll submit an RFC patch to change it - if I can make it work.

>
>> I don't understand how it can ever be "too late", though I appreciate
>> that in some cases "sooner" is better than "later"
>> If we give up on the single atomic_t counter, then we must accept that
>> the number of eleme

Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-05 Thread NeilBrown
On Wed, Apr 04 2018, Andreas Grünbacher wrote:

> Herbert Xu  schrieb am Mi. 4. Apr. 2018 um
> 17:51:
>
>> On Wed, Apr 04, 2018 at 11:46:28AM -0400, Bob Peterson wrote:
>> >
>> > The patches look good. The big question is whether to add them to this
>> > merge window while it's still open. Opinions?
>>
>> We're still hashing out the rhashtable interface so I don't think now is
>> the time to rush things.
>
>
> Fair enough. No matter how rhashtable_walk_peek changes, we‘ll still need
> these two patches to fix the glock dump though.

Those two patches look fine to me and don't depend on changes to
rhashtable, so it is up to you when they go upstream.

However, I think the code can be substantially simplified, particularly
once we make rhashtable a little cleverer.
So this is what I'll probably be doing for a similar situation in
lustre

Having examined seqfile closely, it is apparent that if ->start never
changes *ppos, and if ->next always increments it (except maybe on error)
then

1/ ->next is only ever given a 'pos' that was returned by the previous
   call to ->start or ->next.  So it should *always* return the next
   object, after the one previously returned by ->start or ->next.  It
   never needs to 'seek'. The 'traverse()' function in seq_file.c does
   any seeking needed.  ->next needs to increase 'pos' and when walking
   a dynamic list, it is easiest if it just increments it.

2/ ->start is only called with a pos of:
0 - in this case it should rewind to the start
the last pos passed to ->start of ->next
 In this case it should return the same thing that was
 returned last time.  If it no longer exists, then
 the following one should be returned.
one more than the last pos passed to ->start or ->next
 In this case it should return the object after the
 last one returned.

The proposed enhancement to rhashtable_walk* is to add a
rhashtable_walk_prev() which returns the previously returned object,
if it is still in the table, or NULL. It also enhances
rhashtable_walk_start() so that if the previously returned object is
still in the table, it is preserved as the current cursor.
This means that if you take some action to ensure that the
previously returned object remains in the table until the next ->start,
then you can reliably walk the table with no duplicates or omissions
(unless a concurrent rehash causes duplicates)
If you don't keep the object in the table and it gets removed, then
the 'skip' counter is used to find your place, and you might get
duplicates or omissions.

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-02 Thread NeilBrown
On Fri, Mar 30 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 06:52:34PM +0200, Andreas Gruenbacher wrote:
>>
>> Should rhashtable_walk_peek be kept around even if there are no more
>> users? I have my doubts.
>
> Absolutely.  All netlink dumps using rhashtable_walk_next are buggy
> and need to switch over to rhashtable_walk_peek.  As otherwise
> the object that triggers the out-of-space condition will be skipped
> upon resumption.

Do we really need a rhashtable_walk_peek() interface?
I imagine that a seqfile ->start function can do:

  if (*ppos == 0 && last_pos != 0) {
rhashtable_walk_exit(&iter);
rhashtable_walk_enter(&table, &iter);
last_pos = 0;
  }
  rhashtable_walk_start(&iter);
  if (*ppos == last_pos && iter.p)
return iter.p;
  last_pos = *ppos;
  return rhashtable_walk_next(&iter)


and the ->next function just does

  last_pos = *ppos;
  *ppos += 1;
  do p = rhashtable_walk_next(&iter); while (IS_ERR(p));
  return p;

It might be OK to have a function call instead of expecting people to
use iter.p directly.

static inline void *rhashtable_walk_prev(struct rhashtable_iter *iter)
{
return iter->p;
}

Thoughts?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 0/2] rhashtable_walk fixes

2018-04-02 Thread NeilBrown
On Fri, Mar 30 2018, David Miller wrote:

> From: NeilBrown 
> Date: Thu, 29 Mar 2018 12:19:09 +1100
>
>> These two patches apply on top of my previous "rhashtable: reset iter
>> when rhashtable_walk_start sees new table" patch.
>> 
>> The first fixes a bug that I found in rhltable_insert().
>> 
>> The second is an alternate to my "rhashtable: allow a walk of the hash
>> table without missing object."
>> This version doesn't require an API change and should be reliable for
>> rhltables too (my first version didn't handle these correctly).
>
> Neil, please don't mix and match patches.
>
> Also when you need to change a patch in a series, please post the entire
> new series not just the patch that changes.
>
> Patch #1 in this series is unnecessary.  As Herbert explained this has
> been fixed already.
>
> So please repost freshly the patches that are relevant and you want me
> to consider for inclusion.  Also be explicit and clear about which of
> my two networking trees you are targetting these changes.

Hi Dave,
 I'm sorry if I've caused some confusion, but I didn't think that I was
 submitting patches to you and know nothing about your two trees.
 I was submitting patches to Thomas and Herbert, the registered
 maintainers of rhashtable.  I assumed they would review, respond, and
 take responsibility for getting them upstream, if that's what they
 decided, based on whatever arrangements they have in place.

 If it is appropriate I can resend all of my patches that receive an
 Ack as a coherent series, and send this to you nominating a particular
 tree, but I'm unlikely to do that unless asked and told which tree to
 nominate.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH 0/2] rhashtable_walk fixes

2018-03-28 Thread NeilBrown
These two patches apply on top of my previous "rhashtable: reset iter
when rhashtable_walk_start sees new table" patch.

The first fixes a bug that I found in rhltable_insert().

The second is an alternate to my "rhashtable: allow a walk of the hash
table without missing object."
This version doesn't require an API change and should be reliable for
rhltables too (my first version didn't handle these correctly).

Thanks,
NeilBrown


---

NeilBrown (2):
  rhashtable: fix insertion of in rhltable when duplicate found.
  rhashtable: improve rhashtable_walk stability when stop/start used.


 include/linux/rhashtable.h |4 +++-
 lib/rhashtable.c   |   48 
 2 files changed, 47 insertions(+), 5 deletions(-)

--
Signature



[PATCH 1/2] rhashtable: fix insertion of in rhltable when duplicate found.

2018-03-28 Thread NeilBrown
When rhltable_insert() finds an entry with the same key,
it splices the new entry at the start of a list of entries with the
same key.
It stores the address of the new object in *pprev, but in general this
is *not* the location where the match was found (though in a common
case where the hash chain has one element, it will be).
To fix this, pprev should be updated every time we find an object that
doesn't match.

This patch changes the behaviour for non-slow insertion in that now
insertion happens at the end of a chain rather than at the head.
I don't think this is an important change.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |4 +++-
 lib/rhashtable.c   |4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c9df2527e0cd..668a21f04b09 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -766,8 +766,10 @@ static inline void *__rhashtable_insert_fast(
if (!key ||
(params.obj_cmpfn ?
 params.obj_cmpfn(&arg, rht_obj(ht, head)) :
-rhashtable_compare(&arg, rht_obj(ht, head
+rhashtable_compare(&arg, rht_obj(ht, head {
+   pprev = &head->next;
continue;
+   }
 
data = rht_obj(ht, head);
 
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 394de09a162c..72e56f89e20c 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -506,8 +506,10 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
if (!key ||
(ht->p.obj_cmpfn ?
 ht->p.obj_cmpfn(&arg, rht_obj(ht, head)) :
-rhashtable_compare(&arg, rht_obj(ht, head
+rhashtable_compare(&arg, rht_obj(ht, head {
+   pprev = &head->next;
continue;
+   }
 
if (!ht->rhlist)
return rht_obj(ht, head);




[PATCH 2/2] rhashtable: improve rhashtable_walk stability when stop/start used.

2018-03-28 Thread NeilBrown
When a walk of an rhashtable is interrupted with rhastable_walk_stop()
and then rhashtable_walk_start(), the location to restart from is based
on a 'skip' count in the current hash chain, and this can be incorrect
if insertions or deletions have happened.  This does not happen when
the walk is not stopped and started as iter->p is a placeholder which
is safe to use while holding the RCU read lock.

In rhashtable_walk_start() we can revalidate that 'p' is still in the
same hash chain.  If it isn't then the current method is still used.

With this patch, if a rhashtable walker ensures that the current
object remains in the table over a stop/start period (possibly by
elevating the reference count if that is sufficient), it can be sure
that a walk will not miss objects that were in the hashtable for the
whole time of the walk.

rhashtable_walk_start() may not find the object even though it is
still in the hashtable if a rehash has moved it to a new table.  In
this case it will (eventually) get -EAGAIN and will need to proceed
through the whole table again to be sure to see everything at least
once.

Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   44 +---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 72e56f89e20c..46cf0cc6cefc 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -725,6 +725,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
__acquires(RCU)
 {
struct rhashtable *ht = iter->ht;
+   bool rhlist = ht->rhlist;
 
rcu_read_lock();
 
@@ -733,13 +734,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
list_del(&iter->walker.list);
spin_unlock(&ht->lock);
 
-   if (!iter->walker.tbl && !iter->end_of_table) {
+   if (iter->end_of_table)
+   return 0;
+   if (!iter->walker.tbl) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
iter->slot = 0;
iter->skip = 0;
return -EAGAIN;
}
 
+   if (iter->p && !rhlist) {
+   /*
+* We need to validate that 'p' is still in the table, and
+* if so, update 'skip'
+*/
+   struct rhash_head *p;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   skip++;
+   if (p == iter->p) {
+   iter->skip = skip;
+   goto found;
+   }
+   }
+   iter->p = NULL;
+   } else if (iter->p && rhlist) {
+   /* Need to validate that 'list' is still in the table, and
+* if so, update 'skip' and 'p'.
+*/
+   struct rhash_head *p;
+   struct rhlist_head *list;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   for (list = container_of(p, struct rhlist_head, rhead);
+list;
+list = rcu_dereference(list->next)) {
+   skip++;
+   if (list == iter->list) {
+   iter->p = p;
+   skip = skip;
+   goto found;
+   }
+   }
+   }
+   iter->p = NULL;
+   }
+found:
return 0;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);
@@ -915,8 +955,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
iter->walker.tbl = NULL;
spin_unlock(&ht->lock);
 
-   iter->p = NULL;
-
 out:
rcu_read_unlock();
 }




Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-28 Thread NeilBrown
On Thu, Mar 29 2018, NeilBrown wrote:

>
> How about storing the hash chains in order by object address?
> Then rhashtable_walk_start() can easily find it's place regardless of
> whether the old object was still present or not, using <= on the
> address.
> "Insert" would need to record an insert location and insert there rather
> than at the head of the chain.
> I might try coding that.

Unfortunately rhltables make this approach unworkable.
However while trying to write the code I found a bug :-(
I'll post a patch for the bug, and a patch to transparently make the
current interface more reliable when the caller keeps the current
object in the table.
I think this is sufficient for all current use-cases.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] gfs2: Stop using rhashtable_walk_peek

2018-03-28 Thread NeilBrown
On Wed, Mar 28 2018, Andreas Gruenbacher wrote:

> Function rhashtable_walk_peek is problematic because there is no
> guarantee that the glock previously returned still exists; when that key
> is deleted, rhashtable_walk_peek can end up returning a different key,
> which would cause an inconsistent glock dump.  So instead of using
> rhashtable_walk_peek, keep track of the current glock in the seq file
> iterator functions.
>
> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/gfs2/glock.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index 82fb5583445c..f1fc353875d3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -55,6 +55,7 @@ struct gfs2_glock_iter {
>   struct gfs2_sbd *sdp;   /* incore superblock   */
>   struct rhashtable_iter hti; /* rhashtable iterator */
>   struct gfs2_glock *gl;  /* current glock struct*/
> + bool gl_held;
>   loff_t last_pos;/* last position   */
>  };
>  
> @@ -1923,9 +1924,11 @@ void gfs2_glock_exit(void)
>  
>  static void gfs2_glock_iter_next(struct gfs2_glock_iter *gi, loff_t n)
>  {
> - if (n == 0)
> - gi->gl = rhashtable_walk_peek(&gi->hti);
> - else {
> + if (n != 0 || !gi->gl) {
> + if (gi->gl_held) {
> + gfs2_glock_queue_put(gi->gl);
> + gi->gl_held = false;
> + }
>   gi->gl = rhashtable_walk_next(&gi->hti);
>   n--;
>   }

Thank for this patch!
The above looks a bit fragile to me.
gfs2_glock_iter_next() (And hence gfs2_glock_seq_start()) will sometimes
exit with gl_held true, and sometimes with it false.
gfs2_glock_seq_stop() assumes that it is false.
Normally gfs2_glock_seq_next() will normally be called between these
two and will clear gl_held, but I don't think there is a hard guarantee
of that.
Maybe we should always 'put' gi->gl in iter_next if gl_held??

Thanks,
NeilBrown



> @@ -1988,7 +1991,10 @@ static void gfs2_glock_seq_stop(struct seq_file *seq, 
> void *iter_ptr)
>  {
>   struct gfs2_glock_iter *gi = seq->private;
>  
> - gi->gl = NULL;
> + if (gi->gl) {
> + lockref_get(&gi->gl->gl_lockref);
> + gi->gl_held = true;
> + }
>   rhashtable_walk_stop(&gi->hti);
>  }
>  
> @@ -2061,6 +2067,7 @@ static int __gfs2_glocks_open(struct inode *inode, 
> struct file *file,
>*/
>   gi->last_pos = -1;
>   gi->gl = NULL;
> + gi->gl_held = false;
>   rhashtable_walk_enter(&gl_hash_table, &gi->hti);
>   }
>   return ret;
> @@ -2076,7 +2083,8 @@ static int gfs2_glocks_release(struct inode *inode, 
> struct file *file)
>   struct seq_file *seq = file->private_data;
>   struct gfs2_glock_iter *gi = seq->private;
>  
> - gi->gl = NULL;
> + if (gi->gl_held)
> + gfs2_glock_put(gi->gl);
>   rhashtable_walk_exit(&gi->hti);
>   return seq_release_private(inode, file);
>  }
> -- 
> 2.14.3


signature.asc
Description: PGP signature


Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-28 Thread NeilBrown
On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 06:17:57PM +1100, NeilBrown wrote:
>>
>> Sounds like over-kill to me.
>> It might be reasonable to have a CONFIG_DEBUG_RHASHTABLE which enables
>> extra to code to catch misuse, but I don't see the justification for
>> always performing these checks.
>> The DEBUG code could just scan the chain (usually quite short) to see if
>> the given element is present.  Of course it might have already been
>> rehashed to the next table, so you would to allow for that possibility -
>> probably check tbl->rehash.
>
> No this is not meant to debug users incorrectly using the cursor.
> This is a replacement of your continue interface by automatically
> validating the cursor.
>
> In fact we can make it even more reliable.  We can insert the walker
> right into the bucket chain, that way the walking will always be
> consistent.
>
> The only problem is that we need be able to differentiate between
> a walker, a normal object, and the end of the list.  I think it
> should be doable.

Yes, I think that could work.  The code to stop over a walker object
during an unlocked search wouldn't be straight forward and would need
careful analysis.

However about storing the hash chains in order by object address?
Then rhashtable_walk_start() can easily find it's place regardless of
whether the old object was still present or not, using <= on the
address.
"Insert" would need to record an insert location and insert there rather
than at the head of the chain.
I might try coding that.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread NeilBrown
On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote:
>>
>> I disagree.  My patch 6 only makes it common instead of exceedingly
>> rare.  If any table in the list other than the first has a chain with 16
>> elements, then trying to insert an element with a hash which matches
>> that chain will fail with -EBUSY.  This is theoretically possible
>> already, though astronomically unlikely.  So that case will never be
>> tested for.
>
> No that's not true.  If the table is correctly sized then the
> probability of having a chain with 16 elements is extremely low.

I say "astronomically unlikely", you say "probability .. is extremely
low".  I think we are in agreement here.

The point remains that if an error *can* be returned then I have to
write code to handle it and test that code.  I'd rather not.

>
> Even if it does happen we won't fail because we will perform
> an immediate rehash.  We only fail if it happens right away
> after the rehash (that is, at least another 16 elements have
> been inserted and you're trying to insert a 17th element, all
> while the new hash table has not been completely populated),
> which means that somebody has figured out our hash secret and
> failing in that case makes sense.
>
>> It is hard to know if it is necessary.  And making the new table larger
>> will make the error less likely, but still won't make it impossible.  So
>> callers will have to handle it - just like they currently have to handle
>> -ENOMEM even though it is highly unlikely (and not strictly necessary).
>
> Callers should not handle an ENOMEM error by retrying.  Nor should
> they retry an EBUSY return value.

I never suggested retrying, but I would have to handle it somehow.  I'd
rather not.

>
>> Are these errors ever actually useful?  I thought I had convinced myself
>> before that they were (to throttle attacks on the hash function), but
>> they happen even less often than I thought.
>
> The EBUSY error indicates that the hash table has essentially
> degenereated into a linked list because somebody has worked out
> our hash secret.

While I have no doubt that there are hashtables where someone could try
to attack the hash, I am quite sure there are others where is such an
attack is meaningless - any code which could generate the required range of
keys, could do far worse things more easily.

>
>> Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
>> chain reaches 16 is reasonable, but I think we would want to read it a
>> lot more often than that.  So probably store the last-sampled time (with
>> no locking) and only sample the counter if last-sampled is more than
>>  jiffies - 10*HZ (???)
>
> We could also take the spinlock table approach and have a counter
> per bucket spinlock.  This should be sufficient as you'll contend
> on the bucket spinlock table anyway.

Yes, storing a sharded count in the spinlock table does seem like an
appropriate granularity.  However that leads me to ask: why do we have
the spinlock table?  Why not bit spinlocks in the hashchain head like
include/linux/list_bl uses?

>
> This also allows us to estimate the total table size and not have
> to always do a last-ditch growth when it's too late.

I don't understand how it can ever be "too late", though I appreciate
that in some cases "sooner" is better than "later"
If we give up on the single atomic_t counter, then we must accept that
the number of elements could exceed any given value.  The only promise
we can provide is that it wont exceed N% of the table size for more than
T seconds.

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-28 Thread NeilBrown
On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 08:54:41AM +1100, NeilBrown wrote:
>>
>> Possibly.
>> I particularly want the interface to require that you pass the
>> previously returned object to _continue. That makes it easy to see that
>> the object is still being used.  If someone changes to code to delete
>> the object before the _continue, there should be a strong hint that it
>> won't work.
>> 
>> Maybe it would be better to make it a WARN_ON()
>> 
>>   if (!obj || WARN_ON(iter->p != obj))
>>  iter->p = NULL;
>
> This doesn't really protect against the case where obj is removed.
> All it proves is that the user saved a copy of obj which we already
> did anyway.

True.  We ultimately need to trust the user not to do something silly.
We can do little things to help catch obvious errors though.  That was
my intent with the WARN_ON.

>
> To detect an actual removal you'd need to traverse the list.

Yes.  You could reset ->skip to zero and search for the given object.
That feels a bit like excess hand-holding to me, but probably wouldn't
be too expensive.  It only happens when you need to drop out of RCU, and
in that case you are probably already doing something expensive.

>
> I have another idea: we could save insert the walkers into the
> hash table chain at the end, essentially as a hidden list.  We
> can mark it with a flag like rht_marker so that normal traversal
> doesn't see it.
>
> That way the removal code can simply traverse that list and inform
> them that the iterator is invalid.

Sounds like over-kill to me.
It might be reasonable to have a CONFIG_DEBUG_RHASHTABLE which enables
extra to code to catch misuse, but I don't see the justification for
always performing these checks.
The DEBUG code could just scan the chain (usually quite short) to see if
the given element is present.  Of course it might have already been
rehashed to the next table, so you would to allow for that possibility -
probably check tbl->rehash.

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread NeilBrown
On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote:
>>
>> It is easy to get an -EBUSY insertion failure when .disable_count is
>> enabled, and I did get that.  Blindly propagating that up caused lustre
>> to get terribly confused - not too surprising really.
>
> Right, so this failure mode is specific to your patch 6.

I disagree.  My patch 6 only makes it common instead of exceedingly
rare.  If any table in the list other than the first has a chain with 16
elements, then trying to insert an element with a hash which matches
that chain will fail with -EBUSY.  This is theoretically possible
already, though astronomically unlikely.  So that case will never be
tested for.

>
> I think I see the problem.  As it currently stands, we never
> need growing when we hit the bucket length limit.  We only do
> rehashes.
>
> With your patch, you must change it so that we actually try to
> grow the table if necessary when we hit the bucket length limit.

It is hard to know if it is necessary.  And making the new table larger
will make the error less likely, but still won't make it impossible.  So
callers will have to handle it - just like they currently have to handle
-ENOMEM even though it is highly unlikely (and not strictly necessary).

Are these errors ever actually useful?  I thought I had convinced myself
before that they were (to throttle attacks on the hash function), but
they happen even less often than I thought.

>
> Otherwise it will result in the EBUSY that you're seeing.
>
> I laso think that we shouldn't make this a toggle.  If we're going
> to do disable_count then it should be unconditionally done for
> everyone.
>
> However, I personally would prefer a percpu elem count instead of
> disabling it completely.  Would that be acceptable to your use-case?

Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
chain reaches 16 is reasonable, but I think we would want to read it a
lot more often than that.  So probably store the last-sampled time (with
no locking) and only sample the counter if last-sampled is more than
 jiffies - 10*HZ (???)

In the case in lustre we also shard the LRU list so that adding to the
LRU causes minimal contention. Keeping a shared counter together with
the lru is trivial and summing them periodically is little burden.
Maybe it makes sense to include that functionality if rhashtables so
that it is there for everyone.

A percpu counter uses a lot more memory than atomic_t.  Given that some
callers set nelem_hint to 2 or 3, it seem likely that those callers
don't want to waste memory.  Should we force them to?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()

2018-03-27 Thread NeilBrown
On Wed, Mar 28 2018, Andreas Grünbacher wrote:

> Neil,
>
> 2018-03-27 1:33 GMT+02:00 NeilBrown :
>> The documentation for rhashtable_walk_peek() wrong.  It claims to
>> return the *next* entry, whereas it in fact returns the *previous*
>> entry.
>> However if no entries have yet been returned - or if the iterator
>> was reset due to a resize event, then rhashtable_walk_peek()
>> *does* return the next entry, but also advances the iterator.
>>
>> I suspect that this interface should be discarded and the one user
>> should be changed to not require it.  Possibly this patch should be
>> seen as a first step in that conversation.
>>
>> This patch mostly corrects the documentation, but does make a
>> small code change so that the documentation can be correct without
>> listing too many special cases.  I don't think the one user will
>> be affected by the code change.
>
> how about I come up with a replacement so that we can remove
> rhashtable_walk_peek straight away without making it differently
> broken in the meantime?
>

Hi Andreas,
 I'd be very happy with that outcome - thanks for the offer!

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-27 Thread NeilBrown
On Tue, Mar 27 2018, Herbert Xu wrote:

> On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
>>
>> -int rhashtable_walk_start_check(struct rhashtable_iter *iter)
>> +int rhashtable_walk_start_continue(struct rhashtable_iter *iter, struct 
>> rhash_head *obj)
>>  __acquires(RCU)
>>  {
>>  struct rhashtable *ht = iter->ht;
>>  
>>  rcu_read_lock();
>>  
>> +if (!obj || iter->p != obj)
>> +iter->p = NULL;
>
> Why bother with this check at all? Couldn't we make it so that
> if you call continue then you continue with the cursor otherwise
> you set it to NULL as we currently do.

Possibly.
I particularly want the interface to require that you pass the
previously returned object to _continue. That makes it easy to see that
the object is still being used.  If someone changes to code to delete
the object before the _continue, there should be a strong hint that it
won't work.

Maybe it would be better to make it a WARN_ON()

  if (!obj || WARN_ON(iter->p != obj))
 iter->p = NULL;

??

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-27 Thread NeilBrown
On Tue, Mar 27 2018, David Miller wrote:

> From: NeilBrown 
> Date: Tue, 27 Mar 2018 10:33:04 +1100
>
>> In many cases where the walker needs to drop out of RCU protection,
>> it will take a reference to the object and this can prevent it from
>> being removed from the hash table.  In those cases, the last-returned
>> object can still be used as a cursor.  rhashtable cannot detect
>> these cases itself.
>
> Merely having an elevated reference count does not explicitly prevent
> the object from being removed from the hash table.
>
> This invariant might hold for the particular user of the rhashtable
> instance, but it is not always the case.

Agreed.  Hence "In many case ... this *can* be prevented" and "In those
cases".

The doc comment for rhashtable_walk_start_continue() makes it clear
that:

 *   The
 * previously returned object must still be in the hash table, and must be
 * provided as an argument.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH 1/6 v2] rhashtable: improve documentation for rhashtable_walk_peek()

2018-03-27 Thread NeilBrown

In this version I:
 - fixed brace-after-else coding style, thanks Sergei Shtylyov,
 - explained where the one user is, thanks David Miller
 - added CC for author of rhashtable_walk_peek (Tom Herbert) and
   of the one usage (Andreas Gruenbacher) - thanks Herbert Xu

NeilBrown

-8<-
Subject: [PATCH] rhashtable: improve documentation for rhashtable_walk_peek()

The documentation for rhashtable_walk_peek() wrong.  It claims to
return the *next* entry, whereas it in fact returns the *previous*
entry.
However if no entries have yet been returned - or if the iterator
was reset due to a resize event, then rhashtable_walk_peek()
*does* return the next entry, but also advances the iterator.

I suspect that this interface should be discarded and the one user
should be changed to not require it.  The only current user is
gfs2_glock_iter_next in fs/gfs2/glock.c, which is part of the
implementation of a seq_file which reports all the content of the
hash table.
Possibly this patch should be seen as a first step in that
conversation.

This patch mostly corrects the documentation, but does make a
small code change so that the documentation can be correct without
listing too many special cases.  I don't think gfs2_glock_iter_next
will be affected by the code change.

Cc: Tom Herbert 
Cc: Andreas Gruenbacher 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 3825c30aaa36..9367816820ea 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
 /**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
+ * rhashtable_walk_peek - Return the previously returned object without 
advancing the iterator
  * @iter:  Hash table iterator
  *
- * Returns the next object or NULL when the end of the table is reached.
+ * Returns the last object returned, or NULL if no object has yet been 
returned.
+ * If the previously returned object has since been removed, then some other 
arbitrary
+ * object maybe returned, or possibly NULL will be returned.  In that case, the
+ * iterator might be advanced.
  *
  * Returns -EAGAIN if resize event occurred.  Note that the iterator
- * will rewind back to the beginning and you may continue to use it.
+ * will rewind back to the beginning and rhashtable_walk_next() should be
+ * used to get the next object.
  */
 void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 {
@@ -880,6 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 * the table hasn't changed.
 */
iter->skip--;
+   } else {
+   /* ->skip is only zero after rhashtable_walk_start()
+* or when the iterator is reset.  In this case there
+* is no previous object to return.
+*/
+   return NULL;
}
 
return __rhashtable_walk_find_next(iter);
-- 
2.14.0.rc0.dirty



signature.asc
Description: PGP signature


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-27 Thread NeilBrown
On Tue, Mar 27 2018, Herbert Xu wrote:

> On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
>> The current rhashtable will fail an insertion if the hashtable
>> it "too full", one of:
>>  - table already has 2^31 elements (-E2BIG)
>>  - a max_size was specified and table already has that
>>many elements (rounded up to power of 2) (-E2BIG)
>>  - a single chain has more than 16 elements (-EBUSY)
>>  - table has more elements than the current table size,
>>and allocating a new table fails (-ENOMEM)
>>  - a new page needed to be allocated for a nested table,
>>and the memory allocation failed (-ENOMEM).
>> 
>> A traditional hash table does not have a concept of "too full", and
>> insertion only fails if the key already exists.  Many users of hash
>> tables have separate means of limiting the total number of entries,
>> and are not susceptible to an attack which could cause unusually large
>> hash chains.  For those users, the need to check for errors when
>> inserting objects to an rhashtable is an unnecessary burden and hence
>> a potential source of bugs (as these failures are likely to be rare).
>
> Did you actually encounter an insertion failure? The current code
> should never fail an insertion until you actually run ouf memory.
> That is unless you're using rhashtable when you should be using
> rhlist instead.

It is easy to get an -EBUSY insertion failure when .disable_count is
enabled, and I did get that.  Blindly propagating that up caused lustre
to get terribly confused - not too surprising really.

Even if I didn't seem errors in practive, if the interface can return an
error, then I need to check for the error and really should test that
handling each error works correctly.  It is much easier to write
reliable code when errors cannot happen, so I'd rather have that option.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH 0/6] rhashtable: assorted fixes and enhancements

2018-03-26 Thread NeilBrown
Hi,
 I'm hoping to use rhashtable in lustre, to replace the resizeable
 hashtable implementation in libcfs.
 While working through the conversion I found some minor bugs in the
 rhashtable code and documentation, and some areas where enhancements
 could make rhashtable a better fit for lustre.

 Following 6 patches are the result.  Please review.

 It would help me if I could get an Ack for these patches, and could
 then submit them through the drivers/staging tree together with the
 lustre changes that make use to rhashtable.  The first 2 are mostly
 just fixes to comments and can go in through the netdev tree if you
 prefer - the last 4 are needed for lustre to work
 correctly/optimally.

Thanks,
NeilBrown


---

NeilBrown (6):
  rhashtable: improve documentation for rhashtable_walk_peek()
  rhashtable: remove outdated comments about grow_decision etc
  rhashtable: reset iter when rhashtable_walk_start sees new table
  rhashtable: allow a walk of the hash table without missing objects.
  rhashtable: support guaranteed successful insertion.
  rhashtable: allow element counting to be disabled.


 include/linux/rhashtable.h |   89 --
 lib/rhashtable.c   |  102 +++-
 2 files changed, 136 insertions(+), 55 deletions(-)

--
Signature



[PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()

2018-03-26 Thread NeilBrown
The documentation for rhashtable_walk_peek() wrong.  It claims to
return the *next* entry, whereas it in fact returns the *previous*
entry.
However if no entries have yet been returned - or if the iterator
was reset due to a resize event, then rhashtable_walk_peek()
*does* return the next entry, but also advances the iterator.

I suspect that this interface should be discarded and the one user
should be changed to not require it.  Possibly this patch should be
seen as a first step in that conversation.

This patch mostly corrects the documentation, but does make a
small code change so that the documentation can be correct without
listing too many special cases.  I don't think the one user will
be affected by the code change.

Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 3825c30aaa36..24a57ca494cb 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
 /**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
+ * rhashtable_walk_peek - Return the previously returned object without 
advancing the iterator
  * @iter:  Hash table iterator
  *
- * Returns the next object or NULL when the end of the table is reached.
+ * Returns the last object returned, or NULL if no object has yet been 
returned.
+ * If the previously returned object has since been removed, then some other 
arbitrary
+ * object maybe returned, or possibly NULL will be returned.  In that case, the
+ * iterator might be advanced.
  *
  * Returns -EAGAIN if resize event occurred.  Note that the iterator
- * will rewind back to the beginning and you may continue to use it.
+ * will rewind back to the beginning and rhashtable_walk_next() should be
+ * used to get the next object.
  */
 void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 {
@@ -880,7 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 * the table hasn't changed.
 */
iter->skip--;
-   }
+   } else
+   /* ->skip is only zero after rhashtable_walk_start()
+* or when the iterator is reset.  In this case there
+* is no previous object to return.
+*/
+   return NULL;
 
return __rhashtable_walk_find_next(iter);
 }




[PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-26 Thread NeilBrown
When a walk of the hashtable can be done entirely under RCU,
no objects will be missed - though seeing duplicates is possible.
This is because a cursor is kept in iter->p.
Without the cursor we depend on the ->skip counter.  If an object
before the current location in hash chain is removed, the ->skip
counter will be too large and would could miss a later object.

In many cases where the walker needs to drop out of RCU protection,
it will take a reference to the object and this can prevent it from
being removed from the hash table.  In those cases, the last-returned
object can still be used as a cursor.  rhashtable cannot detect
these cases itself.

This patch adds a new rhashtable_walk_start_continue() interface which
is passed the last object returned.  This can be used if the caller
knows that the object is still in the hash table.  When it is used,
a walk of the hash table will return every object that was in the
hastable for the duration of the walk, at least once.  This can be
used, for example, to selectively delete objects from the table.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   28 ++--
 lib/rhashtable.c   |   42 ++
 2 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 3bd19d29f46b..4ffd96949d4f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -387,11 +387,35 @@ void *rhashtable_insert_slow(struct rhashtable *ht, const 
void *key,
 void rhashtable_walk_enter(struct rhashtable *ht,
   struct rhashtable_iter *iter);
 void rhashtable_walk_exit(struct rhashtable_iter *iter);
-int rhashtable_walk_start_check(struct rhashtable_iter *iter) __acquires(RCU);
+int rhashtable_walk_start_continue(struct rhashtable_iter *iter,
+  struct rhash_head *obj) __acquires(RCU);
+
+/**
+ * rhashtable_walk_start_check - Start a hash table walk
+ * @iter:  Hash table iterator
+ *
+ * Start a hash table walk at the current iterator position.  Note that we take
+ * the RCU lock in all cases including when we return an error.  So you must
+ * always call rhashtable_walk_stop to clean up.
+ *
+ * Returns zero if successful.
+ *
+ * Returns -EAGAIN if resize event occured.  Note that the iterator
+ * will rewind back to the beginning and you may use it immediately
+ * by calling rhashtable_walk_next.
+ *
+ * rhashtable_walk_start is defined as an inline variant that returns
+ * void. This is preferred in cases where the caller would ignore
+ * resize events and always continue.
+ */
+static inline int rhashtable_walk_start_check(struct rhashtable_iter *iter)
+{
+   return rhashtable_walk_start_continue(iter, NULL);
+}
 
 static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
 {
-   (void)rhashtable_walk_start_check(iter);
+   (void)rhashtable_walk_start_continue(iter, NULL);
 }
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 08018198f045..fd6f320b9704 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -702,30 +702,41 @@ void rhashtable_walk_exit(struct rhashtable_iter *iter)
 EXPORT_SYMBOL_GPL(rhashtable_walk_exit);
 
 /**
- * rhashtable_walk_start_check - Start a hash table walk
- * @iter:  Hash table iterator
+ * rhashtable_walk_start_continue - Restart a hash table walk from last object
+ * @iter:  Hask table iterator
+ * @obj:   pointer to rhash_head in last object returned.
+ *
+ * Restart a hash table walk, ensuring not to miss any objects.  The
+ * previously returned object must still be in the hash table, and must be
+ * provided as an argument.
  *
- * Start a hash table walk at the current iterator position.  Note that we take
- * the RCU lock in all cases including when we return an error.  So you must
- * always call rhashtable_walk_stop to clean up.
+ * When rhashtable_walk_start() or rhashtable_walk_start_check() is used,
+ * a deletion since the previous walk_start can result in objects being missed
+ * as a hash chain might be shorter than expected.  This can be avoided by
+ * using the last returned object as a cursor.
  *
- * Returns zero if successful.
+ * If the @obj passed is NULL, or not the most recently returned object,
+ * rhashtable_walk_start_continue() will act like 
rhashtable_walk_start_check();
  *
- * Returns -EAGAIN if resize event occured.  Note that the iterator
- * will rewind back to the beginning and you may use it immediately
- * by calling rhashtable_walk_next.
+ * Returns -EAGAIN if a resize event was detected.  The iterator will
+ * rewind back to the beginning and can be used immediately.  Seeing duplicates
+ * is possible but missing objects isn't.
+ * Returns zero if no resize event was detected.  This does not guarantee
+ * that no duplicates will be seen.
  *
- * rhashtable_walk_start is defined as an 

[PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table

2018-03-26 Thread NeilBrown
The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table.  This is not true.  We need to set ->slot and
->skip to be zero for it to be true.

Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 24a57ca494cb..08018198f045 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -733,6 +733,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
 
if (!iter->walker.tbl && !iter->end_of_table) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+   iter->slot = 0;
+   iter->skip = 0;
return -EAGAIN;
}
 




[PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-26 Thread NeilBrown
The current rhashtable will fail an insertion if the hashtable
it "too full", one of:
 - table already has 2^31 elements (-E2BIG)
 - a max_size was specified and table already has that
   many elements (rounded up to power of 2) (-E2BIG)
 - a single chain has more than 16 elements (-EBUSY)
 - table has more elements than the current table size,
   and allocating a new table fails (-ENOMEM)
 - a new page needed to be allocated for a nested table,
   and the memory allocation failed (-ENOMEM).

A traditional hash table does not have a concept of "too full", and
insertion only fails if the key already exists.  Many users of hash
tables have separate means of limiting the total number of entries,
and are not susceptible to an attack which could cause unusually large
hash chains.  For those users, the need to check for errors when
inserting objects to an rhashtable is an unnecessary burden and hence
a potential source of bugs (as these failures are likely to be rare).

This patch adds a "never_fail_insert" configuration parameter which
ensures that insertion will only fail if the key already exists.

When this option is in effect:
 - nelems is capped at INT_MAX and will never decrease once it reaches
   that value
 - max_size is largely ignored
 - elements will be added to a table that is nominally "full", though
   a rehash will be scheduled
 - a new table will never be allocated directly by the insert
   function, that is always left for the worker.
   For this to trigger a rehash when long chains are detected (possibly
   still useful) an extra field in the table records if a long chain
   has been seen.  This shares a word with the 'nest' value.  As
   'nest' is never changed once the table is created, updating the
   new ->long_chain without locking cannot cause any corruption.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   18 +++---
 lib/rhashtable.c   |   27 +++
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 4ffd96949d4f..abdeb1f3f378 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -77,6 +77,7 @@ struct rhlist_head {
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
  * @nest: Number of bits of first-level nested table.
+ * @long_chain: %true when a chain longer than RHT_ELASTICITY seen.
  * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
  * @locks_mask: Mask to apply before accessing locks[]
@@ -89,7 +90,8 @@ struct rhlist_head {
  */
 struct bucket_table {
unsigned intsize;
-   unsigned intnest;
+   unsigned short  nest;
+   boollong_chain;
unsigned intrehash;
u32 hash_rnd;
unsigned intlocks_mask;
@@ -129,6 +131,9 @@ struct rhashtable;
  * @min_size: Minimum size while shrinking
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
  * @automatic_shrinking: Enable automatic shrinking of tables
+ * @never_fail_insert: Insert will always succeed, even if table will become
+ *   unbalanced.  Without this, -E2BIG, -EBUSY, and -ENOMEM are 
possible
+ *   errors from rhashtable_*insert*()
  * @nulls_base: Base value to generate nulls marker
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
@@ -142,6 +147,7 @@ struct rhashtable_params {
unsigned intmax_size;
u16 min_size;
boolautomatic_shrinking;
+   boolnever_fail_insert;
u8  locks_mul;
u32 nulls_base;
rht_hashfn_thashfn;
@@ -832,7 +838,10 @@ static inline void *__rhashtable_insert_fast(
 
rcu_assign_pointer(*pprev, obj);
 
-   atomic_inc(&ht->nelems);
+   if (params.never_fail_insert)
+   atomic_add_unless(&ht->nelems, 1, INT_MAX);
+   else
+   atomic_inc(&ht->nelems);
if (rht_grow_above_75(ht, tbl))
schedule_work(&ht->run_work);
 
@@ -1104,7 +1113,10 @@ static inline int __rhashtable_remove_fast_one(
spin_unlock_bh(lock);
 
if (err > 0) {
-   atomic_dec(&ht->nelems);
+   if (params.never_fail_insert)
+   atomic_add_unless(&ht->nelems, -1, INT_MAX);
+   else
+   atomic_dec(&ht->nelems);
if (unlikely(ht->p.automatic_shrinking &&
 rht_shrink_below_30(ht, tbl)))
schedule_work(&ht->run_work);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index fd6f320b9704..427

[PATCH 2/6] rhashtable: remove outdated comments about grow_decision etc

2018-03-26 Thread NeilBrown
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c9df2527e0cd..3bd19d29f46b 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -834,9 +834,8 @@ static inline void *__rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -864,9 +863,8 @@ static inline int rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert_key(
struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -888,9 +886,8 @@ static inline int rhltable_insert_key(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert(
struct rhltable *hlt, struct rhlist_head *list,
@@ -920,9 +917,8 @@ static inline int rhltable_insert(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_lookup_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -979,9 +975,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
  *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  *
  * Returns zero on success.
  */
@@ -1132,8 +1127,8 @@ static inline int __rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */
@@ -1154,8 +1149,8 @@ static inline int rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */




[PATCH 6/6] rhashtable: allow element counting to be disabled.

2018-03-26 Thread NeilBrown
If multiple CPUs are performing concurrent updates, they can
contend on accessing the element counter even when they
don't often content on hash chains or spin locks.  This can
hurt performance.

The nelems counter is only used to trigger a resize at the
70% and 30% marks, so it does not need to be precise.

It is easy to calculate an approximate value when the table
is being rehashed, and this happens when a chain is found to
be 16 elements long.  So just moving the counting from
"every update" to "every rehash" removes lots of contention,
but has the down-side is that it allows the max bucket size
to grow to 16 (so average is probably under 8).  The normal
average is close to 1.

As a rehash can sometimes not see all (or any) elements, such as when
multiple tables are in the table chain, it is only safe to increase
nelems to match the number rehashed, never to decrease it.

If a client wants minimal contention while still maintaining
a shorter chain length, it can run a periodic task which
counts the number of elements and updates ->nelems directly.

Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   26 ++
 lib/rhashtable.c   |   22 +++---
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index abdeb1f3f378..d0ce5635540f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -134,6 +134,11 @@ struct rhashtable;
  * @never_fail_insert: Insert will always succeed, even if table will become
  *   unbalanced.  Without this, -E2BIG, -EBUSY, and -ENOMEM are 
possible
  *   errors from rhashtable_*insert*()
+ * @disable_count: Disable precise counting of number of entries.  It is only
+ *   updated approximately when the hash table is resized.
+ *   This reduces contention in parallel updates, but means we only
+ *   grow the table when a hash chain length reaches 16 or when owner
+ *   directly updates ->nelems.
  * @nulls_base: Base value to generate nulls marker
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
@@ -148,6 +153,7 @@ struct rhashtable_params {
u16 min_size;
boolautomatic_shrinking;
boolnever_fail_insert;
+   booldisable_count;
u8  locks_mul;
u32 nulls_base;
rht_hashfn_thashfn;
@@ -838,10 +844,12 @@ static inline void *__rhashtable_insert_fast(
 
rcu_assign_pointer(*pprev, obj);
 
-   if (params.never_fail_insert)
-   atomic_add_unless(&ht->nelems, 1, INT_MAX);
-   else
-   atomic_inc(&ht->nelems);
+   if (!params.disable_count) {
+   if (params.never_fail_insert)
+   atomic_add_unless(&ht->nelems, 1, INT_MAX);
+   else
+   atomic_inc(&ht->nelems);
+   }
if (rht_grow_above_75(ht, tbl))
schedule_work(&ht->run_work);
 
@@ -1113,10 +1121,12 @@ static inline int __rhashtable_remove_fast_one(
spin_unlock_bh(lock);
 
if (err > 0) {
-   if (params.never_fail_insert)
-   atomic_add_unless(&ht->nelems, -1, INT_MAX);
-   else
-   atomic_dec(&ht->nelems);
+   if (!params.disable_count) {
+   if (params.never_fail_insert)
+   atomic_add_unless(&ht->nelems, -1, INT_MAX);
+   else
+   atomic_dec(&ht->nelems);
+   }
if (unlikely(ht->p.automatic_shrinking &&
 rht_shrink_below_30(ht, tbl)))
schedule_work(&ht->run_work);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 427836aace60..686193faf271 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -278,12 +278,13 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
spinlock_t *old_bucket_lock;
int err;
+   int cnt = 0;
 
old_bucket_lock = rht_bucket_lock(old_tbl, old_hash);
 
spin_lock_bh(old_bucket_lock);
while (!(err = rhashtable_rehash_one(ht, old_hash)))
-   ;
+   cnt++;
 
if (err == -ENOENT) {
old_tbl->rehash++;
@@ -291,7 +292,7 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
}
spin_unlock_bh(old_bucket_lock);
 
-   return err;
+   return err ?: cnt;
 }
 
 static int rhashtable_rehash_attach(struct rhashtable *ht,
@@ -324,6 +325,7 @@ static int rhashtable_rehash_table(struct rhashtable *h

[PATCH] net/sunrpc/xprt_sock: fix regression in connection error reporting.

2017-07-18 Thread NeilBrown

Commit 3d4762639dd3 ("tcp: remove poll() flakes when receiving
RST") in v4.12 changed the order in which ->sk_state_change()
and ->sk_error_report() are called when a socket is shut
down - sk_state_change() is now called first.

This causes xs_tcp_state_change() -> xs_sock_mark_closed() ->
xprt_disconnect_done() to wake all pending tasked with -EAGAIN.
When the ->sk_error_report() callback arrives, it is too late to
pass the error on, and it is lost.

As easy way to demonstrate the problem caused is to try to start
rpc.nfsd while rcpbind isn't running.
nfsd will attempt a tcp connection to rpcbind.  A ECONNREFUSED
error is returned, but sunrpc code loses the error and keeps
retrying.  If it saw the ECONNREFUSED, it would abort.

To fix this, handle the sk->sk_err in the TCP_CLOSE branch of
xs_tcp_state_change().

Fixes: 3d4762639dd3 ("tcp: remove poll() flakes when receiving RST")
Cc: sta...@vger.kernel.org (v4.12)
Signed-off-by: NeilBrown 
---
 net/sunrpc/xprtsock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d5b54c020dec..4f154d388748 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1624,6 +1624,8 @@ static void xs_tcp_state_change(struct sock *sk)
if (test_and_clear_bit(XPRT_SOCK_CONNECTING,
&transport->sock_state))
xprt_clear_connecting(xprt);
+   if (sk->sk_err)
+   xprt_wake_pending_tasks(xprt, -sk->sk_err);
xs_sock_mark_closed(xprt);
}
  out:
-- 
2.12.2



signature.asc
Description: PGP signature


[PATCH 003 of 3] knfsd: Remove CONFIG_IPV6 ifdefs from sunrpc server code.

2007-03-01 Thread NeilBrown

They don't really save that much, and aren't worth the hassle.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./include/linux/sunrpc/svc.h |2 --
 ./net/sunrpc/svcsock.c   |   13 +++--
 2 files changed, 3 insertions(+), 12 deletions(-)

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h2007-03-02 14:20:13.0 +1100
+++ ./include/linux/sunrpc/svc.h2007-03-02 15:14:11.0 +1100
@@ -194,9 +194,7 @@ static inline void svc_putu32(struct kve
 
 union svc_addr_u {
 struct in_addr addr;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 struct in6_addraddr6;
-#endif
 };
 
 /*

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c  2007-03-02 15:12:52.0 +1100
+++ ./net/sunrpc/svcsock.c  2007-03-02 15:14:11.0 +1100
@@ -131,13 +131,13 @@ static char *__svc_print_addr(struct soc
NIPQUAD(((struct sockaddr_in *) addr)->sin_addr),
htons(((struct sockaddr_in *) addr)->sin_port));
break;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+
case AF_INET6:
snprintf(buf, len, "%x:%x:%x:%x:%x:%x:%x:%x, port=%u",
NIP6(((struct sockaddr_in6 *) addr)->sin6_addr),
htons(((struct sockaddr_in6 *) addr)->sin6_port));
break;
-#endif
+
default:
snprintf(buf, len, "unknown address type: %d", addr->sa_family);
break;
@@ -449,9 +449,7 @@ svc_wake_up(struct svc_serv *serv)
 
 union svc_pktinfo_u {
struct in_pktinfo pkti;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
struct in6_pktinfo pkti6;
-#endif
 };
 
 static void svc_set_cmsg_data(struct svc_rqst *rqstp, struct cmsghdr *cmh)
@@ -467,7 +465,7 @@ static void svc_set_cmsg_data(struct svc
cmh->cmsg_len = CMSG_LEN(sizeof(*pki));
}
break;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+
case AF_INET6: {
struct in6_pktinfo *pki = CMSG_DATA(cmh);
 
@@ -479,7 +477,6 @@ static void svc_set_cmsg_data(struct svc
cmh->cmsg_len = CMSG_LEN(sizeof(*pki));
}
break;
-#endif
}
return;
 }
@@ -730,13 +727,11 @@ static inline void svc_udp_get_dest_addr
rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr;
break;
}
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
case AF_INET6: {
struct in6_pktinfo *pki = CMSG_DATA(cmh);
ipv6_addr_copy(&rqstp->rq_daddr.addr6, &pki->ipi6_addr);
break;
}
-#endif
}
 }
 
@@ -976,11 +971,9 @@ static inline int svc_port_is_privileged
case AF_INET:
return ntohs(((struct sockaddr_in *)sin)->sin_port)
< PROT_SOCK;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
case AF_INET6:
return ntohs(((struct sockaddr_in6 *)sin)->sin6_port)
< PROT_SOCK;
-#endif
default:
return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 002 of 3] knfsd: Avoid checksum checks when collecting metadata for a UDP packet.

2007-03-01 Thread NeilBrown

When recv_msg is called with a size of 0 and MSG_PEEK (and
sunrpc/svcsock.c does), it is clear that we only interested in
metadata (from/to addresses) and not the data, so don't do any
checksum checking at this point.  Leave that until the data is
requested.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./net/ipv4/udp.c |3 +++
 ./net/ipv6/udp.c |4 
 2 files changed, 7 insertions(+)

diff .prev/net/ipv4/udp.c ./net/ipv4/udp.c
--- .prev/net/ipv4/udp.c2007-03-02 14:20:13.0 +1100
+++ ./net/ipv4/udp.c2007-03-02 15:13:50.0 +1100
@@ -846,6 +846,9 @@ try_again:
goto csum_copy_err;
copy_only = 1;
}
+   if (len == 0 &&  (flags & MSG_PEEK))
+   /* avoid checksum concerns when just getting metadata */
+   copy_only = 1;
 
if (copy_only)
err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr),

diff .prev/net/ipv6/udp.c ./net/ipv6/udp.c
--- .prev/net/ipv6/udp.c2007-03-02 14:20:13.0 +1100
+++ ./net/ipv6/udp.c2007-03-02 15:13:50.0 +1100
@@ -151,6 +151,10 @@ try_again:
copy_only = 1;
}
 
+   if (len == 0 &&  (flags & MSG_PEEK))
+   /* avoid checksum concerns when just getting metadata */
+   copy_only = 1;
+
if (copy_only)
err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr),
  msg->msg_iov, copied   );
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying

2007-03-01 Thread NeilBrown

The sunrpc server code needs to know the source and destination address
for UDP packets so it can reply properly. 
It currently copies code out of the network stack to pick the pieces out
of the skb.
This is ugly and causes compile problems with the IPv6 stuff.

So, rip that out and use recv_msg instead.  This is a much cleaner
interface, but has a slight cost in that the checksum is now checked
before the copy, so we don't benefit from doing both at the same time.
This can probably be fixed.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./net/sunrpc/svcsock.c |   63 -
 1 file changed, 31 insertions(+), 32 deletions(-)

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c  2007-03-02 14:20:14.0 +1100
+++ ./net/sunrpc/svcsock.c  2007-03-02 15:12:52.0 +1100
@@ -721,45 +721,23 @@ svc_write_space(struct sock *sk)
}
 }
 
-static void svc_udp_get_sender_address(struct svc_rqst *rqstp,
-   struct sk_buff *skb)
+static inline void svc_udp_get_dest_address(struct svc_rqst *rqstp,
+   struct cmsghdr *cmh)
 {
switch (rqstp->rq_sock->sk_sk->sk_family) {
case AF_INET: {
-   /* this seems to come from net/ipv4/udp.c:udp_recvmsg */
-   struct sockaddr_in *sin = svc_addr_in(rqstp);
-
-   sin->sin_family = AF_INET;
-   sin->sin_port = skb->h.uh->source;
-   sin->sin_addr.s_addr = skb->nh.iph->saddr;
-   rqstp->rq_addrlen = sizeof(struct sockaddr_in);
-   /* Remember which interface received this request */
-   rqstp->rq_daddr.addr.s_addr = skb->nh.iph->daddr;
-   }
+   struct in_pktinfo *pki = CMSG_DATA(cmh);
+   rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr;
break;
+   }
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
case AF_INET6: {
-   /* this is derived from net/ipv6/udp.c:udpv6_recvmesg */
-   struct sockaddr_in6 *sin6 = svc_addr_in6(rqstp);
-
-   sin6->sin6_family = AF_INET6;
-   sin6->sin6_port = skb->h.uh->source;
-   sin6->sin6_flowinfo = 0;
-   sin6->sin6_scope_id = 0;
-   if (ipv6_addr_type(&sin6->sin6_addr) &
-   IPV6_ADDR_LINKLOCAL)
-   sin6->sin6_scope_id = IP6CB(skb)->iif;
-   ipv6_addr_copy(&sin6->sin6_addr,
-   &skb->nh.ipv6h->saddr);
-   rqstp->rq_addrlen = sizeof(struct sockaddr_in);
-   /* Remember which interface received this request */
-   ipv6_addr_copy(&rqstp->rq_daddr.addr6,
-   &skb->nh.ipv6h->saddr);
-   }
+   struct in6_pktinfo *pki = CMSG_DATA(cmh);
+   ipv6_addr_copy(&rqstp->rq_daddr.addr6, &pki->ipi6_addr);
break;
+   }
 #endif
}
-   return;
 }
 
 /*
@@ -771,7 +749,15 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
struct svc_sock *svsk = rqstp->rq_sock;
struct svc_serv *serv = svsk->sk_server;
struct sk_buff  *skb;
+   charbuffer[CMSG_SPACE(sizeof(union svc_pktinfo_u))];
+   struct cmsghdr *cmh = (struct cmsghdr *)buffer;
int err, len;
+   struct msghdr msg = {
+   .msg_name = svc_addr(rqstp),
+   .msg_control = cmh,
+   .msg_controllen = sizeof(buffer),
+   .msg_flags = MSG_DONTWAIT,
+   };
 
if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags))
/* udp sockets need large rcvbuf as all pending
@@ -797,7 +783,9 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
}
 
clear_bit(SK_DATA, &svsk->sk_flags);
-   while ((skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err)) == NULL) {
+   while ((err == kernel_recvmsg(svsk->sk_sock, &msg, NULL,
+ 0, 0, MSG_PEEK)) < 0 ||
+  (skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err)) == NULL) {
if (err == -EAGAIN) {
svc_sock_received(svsk);
return err;
@@ -805,6 +793,7 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
/* possibly an icmp error */
dprintk("svc: recvfrom returned error %d\n", -err);
}
+   rqstp->rq_addrlen = sizeof(rqstp->rq_addr);
if (skb->tstamp.off_sec == 0) {
struct timeval tv;
 
@@ -827,7 +816,7 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 
rqstp->rq_prot = IPPROTO_UDP;
 
-   svc_udp_get_se

[PATCH 000 of 3] knfsd: Resolve IPv6 related link error

2007-03-01 Thread NeilBrown
Current mainline has a compile linkage problem if both
  CONFIG_IPV6=m
  CONFIG_SUNRPC=y

because net/sunrpc/svcsock.c conditionally used a function defined in the IPv6 
module.

These three patches resolve the issue.

The problem is caused because svcsock needs to get the source and
destination address for a udp packet, but doesn't want to just use
sock_recvmsg like userspace would as it wants to be able to use the
data directly out of the skbuff rather than copying it (when practical).

Currently it copies code from udp.c (both ipv4/ and ipv6/) and this
causes the problem.

This patch changes it to use kernel_recvmsg with a length of 0 and
flags of MSG_PEEK to get the addresses but leave the data untouched.

A small problem here is that kernel_recvmsg always checks the
checksum, so in the case of a large packet we will check the checksum
at a different time to when we copy it out into a buffer, which is not ideal.

So the second patch of this series avoids the check when recv_msg is
called with size==0 and flags==MSG_PEEK.  This change should be acked
by someone on netdev before going upsteam!!!  The rest of the series
is still appropriate without the patch, it is just a small
optimisation.

Finally the last patch removes all the
  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
from sunrpc as it really isn't needed and just hides this sort of problem.

Patches 1 and 3 are suitable for 2.6.21.  Patch 2 needs confirmation.

Thanks,
NeilBrown

 [PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of 
code-copying
 [PATCH 002 of 3] knfsd: Avoid checksum checks when collecting metadata for a 
UDP packet.
 [PATCH 003 of 3] knfsd: Remove CONFIG_IPV6 ifdefs from sunrpc server code.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html