Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-04-05 Thread Herbert Xu
On Fri, Apr 06, 2018 at 01:11:56PM +1000, NeilBrown wrote:
>
> You don't need to handle memory allocation failures at the point where
> you insert into the table - adding to a linked list requires no new
> memory.

You do actually.  The r in rhashtable stands for resizable.  We
cannot completely rely on a background hash table resize because
the insertions can be triggered from softirq context and be without
rate-limits of any kind (e.g., incoming TCP connections).

Therefore at some point you either have to wait for the resize or
fail the insertion.  Since we cannot sleep then failing is the only
option.

> The likelihood of the error isn't really the issue - it either can
> happen or it cannot.  If it can, it requires code to handle it.

Sure, but you just have to handle it the same way you would handle
a memory allocation failure, because that's what it essentially is.

I'm sorry if that means you have to restructure your code to do that
but that's what you pay for having a resizable hash table because
at some point you just have to perform a resize.

> Ahhh... I see what you are thinking now.  You aren't suggestion a
> sharded count that is summed periodically.  Rather you are suggesting that
> we divide the hash space into N regions each with their own independent
> counter, and resize the table if any one region reaches 70% occupancy -
> on the assumption that the other regions are likely to be close.  Is
> that right?

Yes.

> It would probably be dangerous to allow automatic shrinking (when just
> one drops below 30%) in that case, but it might be OK for growing.

At the greatest granularity it would be a counter per bucket, so
clearly we need to maintain some kind of bound on the granularity
so our sample space is not too small.

Automatic shrinking shouldn't be an issue because that's the slow
kind of resizing that we punt to kthread context and it can afford
to do a careful count.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-04-05 Thread NeilBrown
On Thu, Mar 29 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>>
>> I say "astronomically unlikely", you say "probability .. is extremely
>> low".  I think we are in agreement here.
>> 
>> The point remains that if an error *can* be returned then I have to
>> write code to handle it and test that code.  I'd rather not.
>
> You have be able to handle errors anyway because of memory allocation
> failures.  Ultimately if you keep inserting you will eventually
> fail with ENOMEM.  So I don't see the issue with an additional
> error value.

You don't need to handle memory allocation failures at the point where
you insert into the table - adding to a linked list requires no new
memory.

If you are running out of memory, you will fail to allocate new objects,
so you won't be able to add them to the rhashtable.  Obviously you have
to handle that failure, and that is likely to save rhashtable from
getting many mem-alloc failures.

But once you have allocated a new object, rhashtable should just accept
it.  It doesn't help anyone to have the insertion fail.
The insertion can currently fail even when allocating a new object would
succeed, as assertion can require a GFP_ATOMIC allocation to succeed,
while allocating a new object might only need a GFP_KERNEL allocation to
succeed. 

>
>> > Even if it does happen we won't fail because we will perform
>> > an immediate rehash.  We only fail if it happens right away
>> > after the rehash (that is, at least another 16 elements have
>> > been inserted and you're trying to insert a 17th element, all
>> > while the new hash table has not been completely populated),
>> > which means that somebody has figured out our hash secret and
>> > failing in that case makes sense.
>
> BTW, you didn't acknowledge this bit which I think is crucial to
> how likely such an error is.

The likelihood of the error isn't really the issue - it either can
happen or it cannot.  If it can, it requires code to handle it.

>
>> I never suggested retrying, but I would have to handle it somehow.  I'd
>> rather not.
>
> ...
>
>> While I have no doubt that there are hashtables where someone could try
>> to attack the hash, I am quite sure there are others where is such an
>> attack is meaningless - any code which could generate the required range of
>> keys, could do far worse things more easily.
>
> Our network hashtable has to be secure against adversaries.  I
> understand that this may not be important to your use-case.  However,
> given the fact that the failure would only occur if an adversary
> is present and actively attacking your hash table, I don't think
> it has that much of a negative effect on your use-case either.

I certainly accept that there are circumstances where it is a real
possibility that an adversary might succeed in attacking the hash
function, and changing the seed for each table is an excellent idea.
It isn't clear to me that failing insertions is needed - it is done
rather late and so doesn't throttle much, and could give the attacker
feedback that they had succeeded (?).

Not all hash tables are susceptible to attack.  It would be nice if
developers working in less exposed areas could use rhashtable without ever
having to check for errors.

Error codes are messages from the implementation to the caller.
They should be chosen to help the caller make useful choices, not to
allow the implementation to avoid awkward situations.

>
> Of course if you can reproduce the EBUSY error without your
> disable_count patch or even after you have fixed the issue I
> have pointed out in your disable_count patch you can still reproduce
> it then that would suggest a real bug and we would need to fix it,
> for everyone.

I suspect that I cannot.  However experience tells me that customers do
things that I cannot and would never expect - they can often trigger
errors that I cannot.  It is best if the error cannot be returned.

>
>> Yes, storing a sharded count in the spinlock table does seem like an
>> appropriate granularity.  However that leads me to ask: why do we have
>> the spinlock table?  Why not bit spinlocks in the hashchain head like
>> include/linux/list_bl uses?
>
> The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
> can elucidate this.

Maybe I'll submit an RFC patch to change it - if I can make it work.

>
>> I don't understand how it can ever be "too late", though I appreciate
>> that in some cases "sooner" is better than "later"
>> If we give up on the single atomic_t counter, then we must accept that
>> the number of elements could exceed any given value.  The only promise
>> we can provide is that it wont exceed N% of the table size for more than
>> T seconds.
>
> Sure.  However, assuming we use an estimate that is hash-based,
> that *should* be fairly accurate assuming that your hash function
> is working in the first place.  It's completely different vs.
> estimating based on a per-cpu count which could be wildly 

Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread Herbert Xu
On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>
> I say "astronomically unlikely", you say "probability .. is extremely
> low".  I think we are in agreement here.
> 
> The point remains that if an error *can* be returned then I have to
> write code to handle it and test that code.  I'd rather not.

You have be able to handle errors anyway because of memory allocation
failures.  Ultimately if you keep inserting you will eventually
fail with ENOMEM.  So I don't see the issue with an additional
error value.

> > Even if it does happen we won't fail because we will perform
> > an immediate rehash.  We only fail if it happens right away
> > after the rehash (that is, at least another 16 elements have
> > been inserted and you're trying to insert a 17th element, all
> > while the new hash table has not been completely populated),
> > which means that somebody has figured out our hash secret and
> > failing in that case makes sense.

BTW, you didn't acknowledge this bit which I think is crucial to
how likely such an error is.

> I never suggested retrying, but I would have to handle it somehow.  I'd
> rather not.

...

> While I have no doubt that there are hashtables where someone could try
> to attack the hash, I am quite sure there are others where is such an
> attack is meaningless - any code which could generate the required range of
> keys, could do far worse things more easily.

Our network hashtable has to be secure against adversaries.  I
understand that this may not be important to your use-case.  However,
given the fact that the failure would only occur if an adversary
is present and actively attacking your hash table, I don't think
it has that much of a negative effect on your use-case either.

Of course if you can reproduce the EBUSY error without your
disable_count patch or even after you have fixed the issue I
have pointed out in your disable_count patch you can still reproduce
it then that would suggest a real bug and we would need to fix it,
for everyone.

> Yes, storing a sharded count in the spinlock table does seem like an
> appropriate granularity.  However that leads me to ask: why do we have
> the spinlock table?  Why not bit spinlocks in the hashchain head like
> include/linux/list_bl uses?

The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
can elucidate this.

> I don't understand how it can ever be "too late", though I appreciate
> that in some cases "sooner" is better than "later"
> If we give up on the single atomic_t counter, then we must accept that
> the number of elements could exceed any given value.  The only promise
> we can provide is that it wont exceed N% of the table size for more than
> T seconds.

Sure.  However, assuming we use an estimate that is hash-based,
that *should* be fairly accurate assuming that your hash function
is working in the first place.  It's completely different vs.
estimating based on a per-cpu count which could be wildly inaccurate.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread NeilBrown
On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote:
>>
>> I disagree.  My patch 6 only makes it common instead of exceedingly
>> rare.  If any table in the list other than the first has a chain with 16
>> elements, then trying to insert an element with a hash which matches
>> that chain will fail with -EBUSY.  This is theoretically possible
>> already, though astronomically unlikely.  So that case will never be
>> tested for.
>
> No that's not true.  If the table is correctly sized then the
> probability of having a chain with 16 elements is extremely low.

I say "astronomically unlikely", you say "probability .. is extremely
low".  I think we are in agreement here.

The point remains that if an error *can* be returned then I have to
write code to handle it and test that code.  I'd rather not.

>
> Even if it does happen we won't fail because we will perform
> an immediate rehash.  We only fail if it happens right away
> after the rehash (that is, at least another 16 elements have
> been inserted and you're trying to insert a 17th element, all
> while the new hash table has not been completely populated),
> which means that somebody has figured out our hash secret and
> failing in that case makes sense.
>
>> It is hard to know if it is necessary.  And making the new table larger
>> will make the error less likely, but still won't make it impossible.  So
>> callers will have to handle it - just like they currently have to handle
>> -ENOMEM even though it is highly unlikely (and not strictly necessary).
>
> Callers should not handle an ENOMEM error by retrying.  Nor should
> they retry an EBUSY return value.

I never suggested retrying, but I would have to handle it somehow.  I'd
rather not.

>
>> Are these errors ever actually useful?  I thought I had convinced myself
>> before that they were (to throttle attacks on the hash function), but
>> they happen even less often than I thought.
>
> The EBUSY error indicates that the hash table has essentially
> degenereated into a linked list because somebody has worked out
> our hash secret.

While I have no doubt that there are hashtables where someone could try
to attack the hash, I am quite sure there are others where is such an
attack is meaningless - any code which could generate the required range of
keys, could do far worse things more easily.

>
>> Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
>> chain reaches 16 is reasonable, but I think we would want to read it a
>> lot more often than that.  So probably store the last-sampled time (with
>> no locking) and only sample the counter if last-sampled is more than
>>  jiffies - 10*HZ (???)
>
> We could also take the spinlock table approach and have a counter
> per bucket spinlock.  This should be sufficient as you'll contend
> on the bucket spinlock table anyway.

Yes, storing a sharded count in the spinlock table does seem like an
appropriate granularity.  However that leads me to ask: why do we have
the spinlock table?  Why not bit spinlocks in the hashchain head like
include/linux/list_bl uses?

>
> This also allows us to estimate the total table size and not have
> to always do a last-ditch growth when it's too late.

I don't understand how it can ever be "too late", though I appreciate
that in some cases "sooner" is better than "later"
If we give up on the single atomic_t counter, then we must accept that
the number of elements could exceed any given value.  The only promise
we can provide is that it wont exceed N% of the table size for more than
T seconds.

Thanks,
NeilBrown


>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread Herbert Xu
On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote:
>
> I disagree.  My patch 6 only makes it common instead of exceedingly
> rare.  If any table in the list other than the first has a chain with 16
> elements, then trying to insert an element with a hash which matches
> that chain will fail with -EBUSY.  This is theoretically possible
> already, though astronomically unlikely.  So that case will never be
> tested for.

No that's not true.  If the table is correctly sized then the
probability of having a chain with 16 elements is extremely low.

Even if it does happen we won't fail because we will perform
an immediate rehash.  We only fail if it happens right away
after the rehash (that is, at least another 16 elements have
been inserted and you're trying to insert a 17th element, all
while the new hash table has not been completely populated),
which means that somebody has figured out our hash secret and
failing in that case makes sense.

> It is hard to know if it is necessary.  And making the new table larger
> will make the error less likely, but still won't make it impossible.  So
> callers will have to handle it - just like they currently have to handle
> -ENOMEM even though it is highly unlikely (and not strictly necessary).

Callers should not handle an ENOMEM error by retrying.  Nor should
they retry an EBUSY return value.

> Are these errors ever actually useful?  I thought I had convinced myself
> before that they were (to throttle attacks on the hash function), but
> they happen even less often than I thought.

The EBUSY error indicates that the hash table has essentially
degenereated into a linked list because somebody has worked out
our hash secret.

> Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
> chain reaches 16 is reasonable, but I think we would want to read it a
> lot more often than that.  So probably store the last-sampled time (with
> no locking) and only sample the counter if last-sampled is more than
>  jiffies - 10*HZ (???)

We could also take the spinlock table approach and have a counter
per bucket spinlock.  This should be sufficient as you'll contend
on the bucket spinlock table anyway.

This also allows us to estimate the total table size and not have
to always do a last-ditch growth when it's too late.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread NeilBrown
On Wed, Mar 28 2018, Herbert Xu wrote:

> On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote:
>>
>> It is easy to get an -EBUSY insertion failure when .disable_count is
>> enabled, and I did get that.  Blindly propagating that up caused lustre
>> to get terribly confused - not too surprising really.
>
> Right, so this failure mode is specific to your patch 6.

I disagree.  My patch 6 only makes it common instead of exceedingly
rare.  If any table in the list other than the first has a chain with 16
elements, then trying to insert an element with a hash which matches
that chain will fail with -EBUSY.  This is theoretically possible
already, though astronomically unlikely.  So that case will never be
tested for.

>
> I think I see the problem.  As it currently stands, we never
> need growing when we hit the bucket length limit.  We only do
> rehashes.
>
> With your patch, you must change it so that we actually try to
> grow the table if necessary when we hit the bucket length limit.

It is hard to know if it is necessary.  And making the new table larger
will make the error less likely, but still won't make it impossible.  So
callers will have to handle it - just like they currently have to handle
-ENOMEM even though it is highly unlikely (and not strictly necessary).

Are these errors ever actually useful?  I thought I had convinced myself
before that they were (to throttle attacks on the hash function), but
they happen even less often than I thought.

>
> Otherwise it will result in the EBUSY that you're seeing.
>
> I laso think that we shouldn't make this a toggle.  If we're going
> to do disable_count then it should be unconditionally done for
> everyone.
>
> However, I personally would prefer a percpu elem count instead of
> disabling it completely.  Would that be acceptable to your use-case?

Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
chain reaches 16 is reasonable, but I think we would want to read it a
lot more often than that.  So probably store the last-sampled time (with
no locking) and only sample the counter if last-sampled is more than
 jiffies - 10*HZ (???)

In the case in lustre we also shard the LRU list so that adding to the
LRU causes minimal contention. Keeping a shared counter together with
the lru is trivial and summing them periodically is little burden.
Maybe it makes sense to include that functionality if rhashtables so
that it is there for everyone.

A percpu counter uses a lot more memory than atomic_t.  Given that some
callers set nelem_hint to 2 or 3, it seem likely that those callers
don't want to waste memory.  Should we force them to?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread Herbert Xu
On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote:
>
> It is easy to get an -EBUSY insertion failure when .disable_count is
> enabled, and I did get that.  Blindly propagating that up caused lustre
> to get terribly confused - not too surprising really.

Right, so this failure mode is specific to your patch 6.

I think I see the problem.  As it currently stands, we never
need growing when we hit the bucket length limit.  We only do
rehashes.

With your patch, you must change it so that we actually try to
grow the table if necessary when we hit the bucket length limit.

Otherwise it will result in the EBUSY that you're seeing.

I laso think that we shouldn't make this a toggle.  If we're going
to do disable_count then it should be unconditionally done for
everyone.

However, I personally would prefer a percpu elem count instead of
disabling it completely.  Would that be acceptable to your use-case?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-27 Thread NeilBrown
On Tue, Mar 27 2018, Herbert Xu wrote:

> On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
>> The current rhashtable will fail an insertion if the hashtable
>> it "too full", one of:
>>  - table already has 2^31 elements (-E2BIG)
>>  - a max_size was specified and table already has that
>>many elements (rounded up to power of 2) (-E2BIG)
>>  - a single chain has more than 16 elements (-EBUSY)
>>  - table has more elements than the current table size,
>>and allocating a new table fails (-ENOMEM)
>>  - a new page needed to be allocated for a nested table,
>>and the memory allocation failed (-ENOMEM).
>> 
>> A traditional hash table does not have a concept of "too full", and
>> insertion only fails if the key already exists.  Many users of hash
>> tables have separate means of limiting the total number of entries,
>> and are not susceptible to an attack which could cause unusually large
>> hash chains.  For those users, the need to check for errors when
>> inserting objects to an rhashtable is an unnecessary burden and hence
>> a potential source of bugs (as these failures are likely to be rare).
>
> Did you actually encounter an insertion failure? The current code
> should never fail an insertion until you actually run ouf memory.
> That is unless you're using rhashtable when you should be using
> rhlist instead.

It is easy to get an -EBUSY insertion failure when .disable_count is
enabled, and I did get that.  Blindly propagating that up caused lustre
to get terribly confused - not too surprising really.

Even if I didn't seem errors in practive, if the interface can return an
error, then I need to check for the error and really should test that
handling each error works correctly.  It is much easier to write
reliable code when errors cannot happen, so I'd rather have that option.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-27 Thread Herbert Xu
On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
> The current rhashtable will fail an insertion if the hashtable
> it "too full", one of:
>  - table already has 2^31 elements (-E2BIG)
>  - a max_size was specified and table already has that
>many elements (rounded up to power of 2) (-E2BIG)
>  - a single chain has more than 16 elements (-EBUSY)
>  - table has more elements than the current table size,
>and allocating a new table fails (-ENOMEM)
>  - a new page needed to be allocated for a nested table,
>and the memory allocation failed (-ENOMEM).
> 
> A traditional hash table does not have a concept of "too full", and
> insertion only fails if the key already exists.  Many users of hash
> tables have separate means of limiting the total number of entries,
> and are not susceptible to an attack which could cause unusually large
> hash chains.  For those users, the need to check for errors when
> inserting objects to an rhashtable is an unnecessary burden and hence
> a potential source of bugs (as these failures are likely to be rare).

Did you actually encounter an insertion failure? The current code
should never fail an insertion until you actually run ouf memory.
That is unless you're using rhashtable when you should be using
rhlist instead.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt