Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Fri, Apr 06, 2018 at 01:11:56PM +1000, NeilBrown wrote: > > You don't need to handle memory allocation failures at the point where > you insert into the table - adding to a linked list requires no new > memory. You do actually. The r in rhashtable stands for resizable. We cannot completely rely on a background hash table resize because the insertions can be triggered from softirq context and be without rate-limits of any kind (e.g., incoming TCP connections). Therefore at some point you either have to wait for the resize or fail the insertion. Since we cannot sleep then failing is the only option. > The likelihood of the error isn't really the issue - it either can > happen or it cannot. If it can, it requires code to handle it. Sure, but you just have to handle it the same way you would handle a memory allocation failure, because that's what it essentially is. I'm sorry if that means you have to restructure your code to do that but that's what you pay for having a resizable hash table because at some point you just have to perform a resize. > Ahhh... I see what you are thinking now. You aren't suggestion a > sharded count that is summed periodically. Rather you are suggesting that > we divide the hash space into N regions each with their own independent > counter, and resize the table if any one region reaches 70% occupancy - > on the assumption that the other regions are likely to be close. Is > that right? Yes. > It would probably be dangerous to allow automatic shrinking (when just > one drops below 30%) in that case, but it might be OK for growing. At the greatest granularity it would be a counter per bucket, so clearly we need to maintain some kind of bound on the granularity so our sample space is not too small. Automatic shrinking shouldn't be an issue because that's the slow kind of resizing that we punt to kthread context and it can afford to do a careful count. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Thu, Mar 29 2018, Herbert Xu wrote: > On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote: >> >> I say "astronomically unlikely", you say "probability .. is extremely >> low". I think we are in agreement here. >> >> The point remains that if an error *can* be returned then I have to >> write code to handle it and test that code. I'd rather not. > > You have be able to handle errors anyway because of memory allocation > failures. Ultimately if you keep inserting you will eventually > fail with ENOMEM. So I don't see the issue with an additional > error value. You don't need to handle memory allocation failures at the point where you insert into the table - adding to a linked list requires no new memory. If you are running out of memory, you will fail to allocate new objects, so you won't be able to add them to the rhashtable. Obviously you have to handle that failure, and that is likely to save rhashtable from getting many mem-alloc failures. But once you have allocated a new object, rhashtable should just accept it. It doesn't help anyone to have the insertion fail. The insertion can currently fail even when allocating a new object would succeed, as assertion can require a GFP_ATOMIC allocation to succeed, while allocating a new object might only need a GFP_KERNEL allocation to succeed. > >> > Even if it does happen we won't fail because we will perform >> > an immediate rehash. We only fail if it happens right away >> > after the rehash (that is, at least another 16 elements have >> > been inserted and you're trying to insert a 17th element, all >> > while the new hash table has not been completely populated), >> > which means that somebody has figured out our hash secret and >> > failing in that case makes sense. > > BTW, you didn't acknowledge this bit which I think is crucial to > how likely such an error is. The likelihood of the error isn't really the issue - it either can happen or it cannot. If it can, it requires code to handle it. > >> I never suggested retrying, but I would have to handle it somehow. I'd >> rather not. > > ... > >> While I have no doubt that there are hashtables where someone could try >> to attack the hash, I am quite sure there are others where is such an >> attack is meaningless - any code which could generate the required range of >> keys, could do far worse things more easily. > > Our network hashtable has to be secure against adversaries. I > understand that this may not be important to your use-case. However, > given the fact that the failure would only occur if an adversary > is present and actively attacking your hash table, I don't think > it has that much of a negative effect on your use-case either. I certainly accept that there are circumstances where it is a real possibility that an adversary might succeed in attacking the hash function, and changing the seed for each table is an excellent idea. It isn't clear to me that failing insertions is needed - it is done rather late and so doesn't throttle much, and could give the attacker feedback that they had succeeded (?). Not all hash tables are susceptible to attack. It would be nice if developers working in less exposed areas could use rhashtable without ever having to check for errors. Error codes are messages from the implementation to the caller. They should be chosen to help the caller make useful choices, not to allow the implementation to avoid awkward situations. > > Of course if you can reproduce the EBUSY error without your > disable_count patch or even after you have fixed the issue I > have pointed out in your disable_count patch you can still reproduce > it then that would suggest a real bug and we would need to fix it, > for everyone. I suspect that I cannot. However experience tells me that customers do things that I cannot and would never expect - they can often trigger errors that I cannot. It is best if the error cannot be returned. > >> Yes, storing a sharded count in the spinlock table does seem like an >> appropriate granularity. However that leads me to ask: why do we have >> the spinlock table? Why not bit spinlocks in the hashchain head like >> include/linux/list_bl uses? > > The spinlock table predates rhashtable. Perhaps Thomas/Eric/Dave > can elucidate this. Maybe I'll submit an RFC patch to change it - if I can make it work. > >> I don't understand how it can ever be "too late", though I appreciate >> that in some cases "sooner" is better than "later" >> If we give up on the single atomic_t counter, then we must accept that >> the number of elements could exceed any given value. The only promise >> we can provide is that it wont exceed N% of the table size for more than >> T seconds. > > Sure. However, assuming we use an estimate that is hash-based, > that *should* be fairly accurate assuming that your hash function > is working in the first place. It's completely different vs. > estimating based on a per-cpu count which could be wildly inaccura
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote: > > I say "astronomically unlikely", you say "probability .. is extremely > low". I think we are in agreement here. > > The point remains that if an error *can* be returned then I have to > write code to handle it and test that code. I'd rather not. You have be able to handle errors anyway because of memory allocation failures. Ultimately if you keep inserting you will eventually fail with ENOMEM. So I don't see the issue with an additional error value. > > Even if it does happen we won't fail because we will perform > > an immediate rehash. We only fail if it happens right away > > after the rehash (that is, at least another 16 elements have > > been inserted and you're trying to insert a 17th element, all > > while the new hash table has not been completely populated), > > which means that somebody has figured out our hash secret and > > failing in that case makes sense. BTW, you didn't acknowledge this bit which I think is crucial to how likely such an error is. > I never suggested retrying, but I would have to handle it somehow. I'd > rather not. ... > While I have no doubt that there are hashtables where someone could try > to attack the hash, I am quite sure there are others where is such an > attack is meaningless - any code which could generate the required range of > keys, could do far worse things more easily. Our network hashtable has to be secure against adversaries. I understand that this may not be important to your use-case. However, given the fact that the failure would only occur if an adversary is present and actively attacking your hash table, I don't think it has that much of a negative effect on your use-case either. Of course if you can reproduce the EBUSY error without your disable_count patch or even after you have fixed the issue I have pointed out in your disable_count patch you can still reproduce it then that would suggest a real bug and we would need to fix it, for everyone. > Yes, storing a sharded count in the spinlock table does seem like an > appropriate granularity. However that leads me to ask: why do we have > the spinlock table? Why not bit spinlocks in the hashchain head like > include/linux/list_bl uses? The spinlock table predates rhashtable. Perhaps Thomas/Eric/Dave can elucidate this. > I don't understand how it can ever be "too late", though I appreciate > that in some cases "sooner" is better than "later" > If we give up on the single atomic_t counter, then we must accept that > the number of elements could exceed any given value. The only promise > we can provide is that it wont exceed N% of the table size for more than > T seconds. Sure. However, assuming we use an estimate that is hash-based, that *should* be fairly accurate assuming that your hash function is working in the first place. It's completely different vs. estimating based on a per-cpu count which could be wildly inaccurate. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Wed, Mar 28 2018, Herbert Xu wrote: > On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote: >> >> I disagree. My patch 6 only makes it common instead of exceedingly >> rare. If any table in the list other than the first has a chain with 16 >> elements, then trying to insert an element with a hash which matches >> that chain will fail with -EBUSY. This is theoretically possible >> already, though astronomically unlikely. So that case will never be >> tested for. > > No that's not true. If the table is correctly sized then the > probability of having a chain with 16 elements is extremely low. I say "astronomically unlikely", you say "probability .. is extremely low". I think we are in agreement here. The point remains that if an error *can* be returned then I have to write code to handle it and test that code. I'd rather not. > > Even if it does happen we won't fail because we will perform > an immediate rehash. We only fail if it happens right away > after the rehash (that is, at least another 16 elements have > been inserted and you're trying to insert a 17th element, all > while the new hash table has not been completely populated), > which means that somebody has figured out our hash secret and > failing in that case makes sense. > >> It is hard to know if it is necessary. And making the new table larger >> will make the error less likely, but still won't make it impossible. So >> callers will have to handle it - just like they currently have to handle >> -ENOMEM even though it is highly unlikely (and not strictly necessary). > > Callers should not handle an ENOMEM error by retrying. Nor should > they retry an EBUSY return value. I never suggested retrying, but I would have to handle it somehow. I'd rather not. > >> Are these errors ever actually useful? I thought I had convinced myself >> before that they were (to throttle attacks on the hash function), but >> they happen even less often than I thought. > > The EBUSY error indicates that the hash table has essentially > degenereated into a linked list because somebody has worked out > our hash secret. While I have no doubt that there are hashtables where someone could try to attack the hash, I am quite sure there are others where is such an attack is meaningless - any code which could generate the required range of keys, could do far worse things more easily. > >> Maybe. Reading a percpu counter isn't cheap. Reading it whenever a hash >> chain reaches 16 is reasonable, but I think we would want to read it a >> lot more often than that. So probably store the last-sampled time (with >> no locking) and only sample the counter if last-sampled is more than >> jiffies - 10*HZ (???) > > We could also take the spinlock table approach and have a counter > per bucket spinlock. This should be sufficient as you'll contend > on the bucket spinlock table anyway. Yes, storing a sharded count in the spinlock table does seem like an appropriate granularity. However that leads me to ask: why do we have the spinlock table? Why not bit spinlocks in the hashchain head like include/linux/list_bl uses? > > This also allows us to estimate the total table size and not have > to always do a last-ditch growth when it's too late. I don't understand how it can ever be "too late", though I appreciate that in some cases "sooner" is better than "later" If we give up on the single atomic_t counter, then we must accept that the number of elements could exceed any given value. The only promise we can provide is that it wont exceed N% of the table size for more than T seconds. Thanks, NeilBrown > > Cheers, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt signature.asc Description: PGP signature
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote: > > I disagree. My patch 6 only makes it common instead of exceedingly > rare. If any table in the list other than the first has a chain with 16 > elements, then trying to insert an element with a hash which matches > that chain will fail with -EBUSY. This is theoretically possible > already, though astronomically unlikely. So that case will never be > tested for. No that's not true. If the table is correctly sized then the probability of having a chain with 16 elements is extremely low. Even if it does happen we won't fail because we will perform an immediate rehash. We only fail if it happens right away after the rehash (that is, at least another 16 elements have been inserted and you're trying to insert a 17th element, all while the new hash table has not been completely populated), which means that somebody has figured out our hash secret and failing in that case makes sense. > It is hard to know if it is necessary. And making the new table larger > will make the error less likely, but still won't make it impossible. So > callers will have to handle it - just like they currently have to handle > -ENOMEM even though it is highly unlikely (and not strictly necessary). Callers should not handle an ENOMEM error by retrying. Nor should they retry an EBUSY return value. > Are these errors ever actually useful? I thought I had convinced myself > before that they were (to throttle attacks on the hash function), but > they happen even less often than I thought. The EBUSY error indicates that the hash table has essentially degenereated into a linked list because somebody has worked out our hash secret. > Maybe. Reading a percpu counter isn't cheap. Reading it whenever a hash > chain reaches 16 is reasonable, but I think we would want to read it a > lot more often than that. So probably store the last-sampled time (with > no locking) and only sample the counter if last-sampled is more than > jiffies - 10*HZ (???) We could also take the spinlock table approach and have a counter per bucket spinlock. This should be sufficient as you'll contend on the bucket spinlock table anyway. This also allows us to estimate the total table size and not have to always do a last-ditch growth when it's too late. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Wed, Mar 28 2018, Herbert Xu wrote: > On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote: >> >> It is easy to get an -EBUSY insertion failure when .disable_count is >> enabled, and I did get that. Blindly propagating that up caused lustre >> to get terribly confused - not too surprising really. > > Right, so this failure mode is specific to your patch 6. I disagree. My patch 6 only makes it common instead of exceedingly rare. If any table in the list other than the first has a chain with 16 elements, then trying to insert an element with a hash which matches that chain will fail with -EBUSY. This is theoretically possible already, though astronomically unlikely. So that case will never be tested for. > > I think I see the problem. As it currently stands, we never > need growing when we hit the bucket length limit. We only do > rehashes. > > With your patch, you must change it so that we actually try to > grow the table if necessary when we hit the bucket length limit. It is hard to know if it is necessary. And making the new table larger will make the error less likely, but still won't make it impossible. So callers will have to handle it - just like they currently have to handle -ENOMEM even though it is highly unlikely (and not strictly necessary). Are these errors ever actually useful? I thought I had convinced myself before that they were (to throttle attacks on the hash function), but they happen even less often than I thought. > > Otherwise it will result in the EBUSY that you're seeing. > > I laso think that we shouldn't make this a toggle. If we're going > to do disable_count then it should be unconditionally done for > everyone. > > However, I personally would prefer a percpu elem count instead of > disabling it completely. Would that be acceptable to your use-case? Maybe. Reading a percpu counter isn't cheap. Reading it whenever a hash chain reaches 16 is reasonable, but I think we would want to read it a lot more often than that. So probably store the last-sampled time (with no locking) and only sample the counter if last-sampled is more than jiffies - 10*HZ (???) In the case in lustre we also shard the LRU list so that adding to the LRU causes minimal contention. Keeping a shared counter together with the lru is trivial and summing them periodically is little burden. Maybe it makes sense to include that functionality if rhashtables so that it is there for everyone. A percpu counter uses a lot more memory than atomic_t. Given that some callers set nelem_hint to 2 or 3, it seem likely that those callers don't want to waste memory. Should we force them to? Thanks, NeilBrown signature.asc Description: PGP signature
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote: > > It is easy to get an -EBUSY insertion failure when .disable_count is > enabled, and I did get that. Blindly propagating that up caused lustre > to get terribly confused - not too surprising really. Right, so this failure mode is specific to your patch 6. I think I see the problem. As it currently stands, we never need growing when we hit the bucket length limit. We only do rehashes. With your patch, you must change it so that we actually try to grow the table if necessary when we hit the bucket length limit. Otherwise it will result in the EBUSY that you're seeing. I laso think that we shouldn't make this a toggle. If we're going to do disable_count then it should be unconditionally done for everyone. However, I personally would prefer a percpu elem count instead of disabling it completely. Would that be acceptable to your use-case? Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Tue, Mar 27 2018, Herbert Xu wrote: > On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote: >> The current rhashtable will fail an insertion if the hashtable >> it "too full", one of: >> - table already has 2^31 elements (-E2BIG) >> - a max_size was specified and table already has that >>many elements (rounded up to power of 2) (-E2BIG) >> - a single chain has more than 16 elements (-EBUSY) >> - table has more elements than the current table size, >>and allocating a new table fails (-ENOMEM) >> - a new page needed to be allocated for a nested table, >>and the memory allocation failed (-ENOMEM). >> >> A traditional hash table does not have a concept of "too full", and >> insertion only fails if the key already exists. Many users of hash >> tables have separate means of limiting the total number of entries, >> and are not susceptible to an attack which could cause unusually large >> hash chains. For those users, the need to check for errors when >> inserting objects to an rhashtable is an unnecessary burden and hence >> a potential source of bugs (as these failures are likely to be rare). > > Did you actually encounter an insertion failure? The current code > should never fail an insertion until you actually run ouf memory. > That is unless you're using rhashtable when you should be using > rhlist instead. It is easy to get an -EBUSY insertion failure when .disable_count is enabled, and I did get that. Blindly propagating that up caused lustre to get terribly confused - not too surprising really. Even if I didn't seem errors in practive, if the interface can return an error, then I need to check for the error and really should test that handling each error works correctly. It is much easier to write reliable code when errors cannot happen, so I'd rather have that option. Thanks, NeilBrown signature.asc Description: PGP signature
Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote: > The current rhashtable will fail an insertion if the hashtable > it "too full", one of: > - table already has 2^31 elements (-E2BIG) > - a max_size was specified and table already has that >many elements (rounded up to power of 2) (-E2BIG) > - a single chain has more than 16 elements (-EBUSY) > - table has more elements than the current table size, >and allocating a new table fails (-ENOMEM) > - a new page needed to be allocated for a nested table, >and the memory allocation failed (-ENOMEM). > > A traditional hash table does not have a concept of "too full", and > insertion only fails if the key already exists. Many users of hash > tables have separate means of limiting the total number of entries, > and are not susceptible to an attack which could cause unusually large > hash chains. For those users, the need to check for errors when > inserting objects to an rhashtable is an unnecessary burden and hence > a potential source of bugs (as these failures are likely to be rare). Did you actually encounter an insertion failure? The current code should never fail an insertion until you actually run ouf memory. That is unless you're using rhashtable when you should be using rhlist instead. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt