Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Hannes Frederic Sowa wrote: > We call extract_crng when we run out of batched entropy and reseed. How > often we call down to extract_crng depends on how much entropy we > extracted by calls to get_random_int/long, so the number of calls into > those functions matter. > > In extract_crng we have a timer which reseeds every 300s the CPRNG and > either uses completely new entropy from the CRNG or calls down into the > CPRNG while also doing backtracing protection (which feeds chacha's > block size / 2 back into chacha, if I read the code correctly, thus > 1024 bits, which should be enough). In the current code, _extract_crng checks to see if more than 300 s have elapsed since last time it was reseeded, and if so, reseeds with fresh entropy. In addition, on every read (or get_random_bytes), if the request leaves enough ranfom bits in the last ChaCha block, it feeds back 256 bits (the ChaCha block size is 16*32 = 512 bits) for anti-backtracking. If the last read happened to not fit under that limit (size % 512 > 256), *and* there are no calls for RNG output for a long time, there is no upper limit to how long the old ChaCha key can hang around. > On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote: >> For example, two mix-backs of 64 bits gives you 65 bit security, not 128. >> (Because each mixback can be guessed and verified separately.) > Exactly, but the full reseed after running out of entropy is strong > enough to not be defeated by your argumentation. Neither the reseed > from the CRNG. Yes, I was just reacting to your original statement: > couldn't we simply use 8 bytes of the 64 byte > return block to feed it directly back into the state chacha? It's not the idea that's bad, just the proposed quantity. >> If you want that, I have a pile of patches to prandom I really >> should push upstream. Shall I refresh them and send them to you? > I would like to have a look at them in the new year, certainly! I can > also take care about the core prandom patches, but don't know if I have > time to submit the others to the different subsystems. > > Maybe, if David would be okay with that, we can submit all patches > through his tree, as he is also the dedicated maintainer for prandom. Amazing, thank you very much! They're just minor cleanups, nothing too exciting. I'll put it in the queue to make sure they're up to date.
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Hello, On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote: > Hannes Frederic Sowa wrote: > > On 24.12.2016 00:39, George Spelvin wrote: > > > We just finished discussing why 8 bytes isn't enough. If you only > > > feed back 8 bytes, an attacker who can do 2^64 computation can find it > > > (by guessing and computing forward to verify the guess) and recover the > > > previous state. You need to feed back at least as much output as your > > > security targete. For /dev/urandom's ChaCha20, that's 256 bits. > > I followed the discussion but it appeared to me that this has the > > additional constraint of time until the next reseeding event happenes, > > which is 300s (under the assumption that calls to get_random_int happen > > regularly, which I expect right now). After that the existing reseeding > > mechansim will ensure enough backtracking protection. The number of > > bytes can easily be increased here, given that reseeding was shown to be > > quite fast already and we produce enough output. But I am not sure if > > this is a bit overengineered in the end? > > I'm not following your description of how the time-based and call-based > mechanisms interact, but for any mix-back, you should either do enough > or none at all. (Also called "catastrophic reseeding".) We call extract_crng when we run out of batched entropy and reseed. How often we call down to extract_crng depends on how much entropy we extracted by calls to get_random_int/long, so the number of calls into those functions matter. In extract_crng we have a timer which reseeds every 300s the CPRNG and either uses completely new entropy from the CRNG or calls down into the CPRNG while also doing backtracing protection (which feeds chacha's block size / 2 back into chacha, if I read the code correctly, thus 1024 bits, which should be enough). > For example, two mix-backs of 64 bits gives you 65 bit security, not 128. > (Because each mixback can be guessed and verified separately.) Exactly, but the full reseed after running out of entropy is strong enough to not be defeated by your argumentation. Neither the reseed from the CRNG. > > Also agreed. Given your solution below to prandom_u32, I do think it > > might also work without the seqlock now. > > It's not technically a seqlock; in particular the reader doesn't > spin. But the write side, and general logic is so similar it's > a good mental model. > > Basically, assume a 64-byte buffer. The reader has gone through > 32 bytes of it, and has 32 left, and when he reads another 8 bytes, > has to distinguish three cases: > > 1) No update; we read the old bytes and there are now 32 - 24 bytes left. > 2) Update completed while we weren't looking. There are now new >bytes in the buffer, and we took 8 leaving 64 - 8 = 56. > 3) Update in progress at the time of the read. We don't know if we >are seeing old bytes or new bytes, so we have to assume the worst >and not proceeed unless 32 >= 8, but assume at the end there are >64 - 8 = 56 new bytes left. > > > I wouldn't have added a disable irqs, but given that I really like your > > proposal, I would take it in my todo branch and submit it when net-next > > window opens. > > If you want that, I have a pile of patches to prandom I really > should push upstream. Shall I refresh them and send them to you? I would like to have a look at them in the new year, certainly! I can also take care about the core prandom patches, but don't know if I have time to submit the others to the different subsystems. Maybe, if David would be okay with that, we can submit all patches through his tree, as he is also the dedicated maintainer for prandom. > [... patch descriptions ...] Thanks, Hannes
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Hannes Frederic Sowa wrote: > On 24.12.2016 00:39, George Spelvin wrote: >> We just finished discussing why 8 bytes isn't enough. If you only >> feed back 8 bytes, an attacker who can do 2^64 computation can find it >> (by guessing and computing forward to verify the guess) and recover the >> previous state. You need to feed back at least as much output as your >> security targete. For /dev/urandom's ChaCha20, that's 256 bits. > I followed the discussion but it appeared to me that this has the > additional constraint of time until the next reseeding event happenes, > which is 300s (under the assumption that calls to get_random_int happen > regularly, which I expect right now). After that the existing reseeding > mechansim will ensure enough backtracking protection. The number of > bytes can easily be increased here, given that reseeding was shown to be > quite fast already and we produce enough output. But I am not sure if > this is a bit overengineered in the end? I'm not following your description of how the time-based and call-based mechanisms interact, but for any mix-back, you should either do enough or none at all. (Also called "catastrophic reseeding".) For example, two mix-backs of 64 bits gives you 65 bit security, not 128. (Because each mixback can be guessed and verified separately.) > Also agreed. Given your solution below to prandom_u32, I do think it > might also work without the seqlock now. It's not technically a seqlock; in particular the reader doesn't spin. But the write side, and general logic is so similar it's a good mental model. Basically, assume a 64-byte buffer. The reader has gone through 32 bytes of it, and has 32 left, and when he reads another 8 bytes, has to distinguish three cases: 1) No update; we read the old bytes and there are now 32 - 24 bytes left. 2) Update completed while we weren't looking. There are now new bytes in the buffer, and we took 8 leaving 64 - 8 = 56. 3) Update in progress at the time of the read. We don't know if we are seeing old bytes or new bytes, so we have to assume the worst and not proceeed unless 32 >= 8, but assume at the end there are 64 - 8 = 56 new bytes left. > I wouldn't have added a disable irqs, but given that I really like your > proposal, I would take it in my todo branch and submit it when net-next > window opens. If you want that, I have a pile of patches to prandom I really should push upstream. Shall I refresh them and send them to you? commit 4cf1b3d9f4fbccc29ffc2fe4ca4ff52ea77253f1 Author: George Spelvin Date: Mon Aug 31 00:05:00 2015 -0400 net: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" The net/802 code was already efficient, but prandom_u32_max() is simpler. In net/batman-adv/bat_iv_ogm.c, batadv_iv_ogm_fwd_send_time() got changed from picking a random number of milliseconds and converting to jiffies to picking a random number of jiffies, since the number of milliseconds (and thus the conversion to jiffies) is a compile-time constant. The equivalent code in batadv_iv_ogm_emit_send_time was not changed, because the number of milliseconds is variable. In net/ipv6/ip6_flowlabel.c, ip6_flowlabel had htonl(prandom_u32()), which is silly. Just cast to __be32 without permuting the bits. net/sched/sch_netem.c got adjusted to only need one call to prandom_u32 instead of 2. (Assuming skb_headlen can't exceed 512 MiB, which is hopefully safe for some time yet.) Signed-off-by: George Spelvin commit 9c8fb80e1fd2be42c35cab1af27187d600fd85e3 Author: George Spelvin Date: Sat May 24 15:20:47 2014 -0400 mm/swapfile.c: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin commit 2743eb01e5c5958fd88ae78d19c5fea772d4b117 Author: George Spelvin Date: Sat May 24 15:19:53 2014 -0400 lib: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin commit 6a5e91bf395060a3351bfe5efc40ac20ffba2c1b Author: George Spelvin Date: Sat May 24 15:18:50 2014 -0400 fs/xfs: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range". Also changed the last argument of xfs_error_test() from "unsigned long" to "unsigned", since the code never did support values > 2^32, and the largest value ever passed is 100. The code could be improved even further by passing in 2^32/rf rather than rf, but I'll leave that to some XFS developers. Signed-off-by: George Spelvin commit 6f6d485d9179ca6ec4e30caa06ade0e0c6931810 Author: George Spelvin Date: Sat May 24 15:00:17 2014 -0400 fs/ubifs: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range". In fs/ubifs/debug.c, the chance() function got rewritten entirely to take advantage o
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Hi, On 24.12.2016 00:39, George Spelvin wrote: > Hannes Frederic Sowa wrote: >> In general this looks good, but bitbuffer needs to be protected from >> concurrent access, thus needing at least some atomic instruction and >> disabling of interrupts for the locking if done outside of >> get_random_long. Thus I liked your previous approach more where you >> simply embed this in the already necessary get_random_long and aliased >> get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO. > > It's meant to be part of the same approach, and I didn't include locking > because that's a requirement for *any* solution, and isn't specific > to the part I was trying to illustrate. > > (As for re-using the name "get_random_long", that was just so > I didn't have to explain it. Call it "get_random_long_internal" > if you like.) > > Possible locking implementations include: > 1) Use the same locking as applies to get_random_long_internal(), or > 2) Make bitbuffer a per-CPU variable (note that we currently have 128 >bits of per-CPU state in get_random_int_hash[]), and this is all a >fast-path to bypass heavier locking in get_random_long_internal(). I understood that this is definitely a solvable problem. >>> But, I just realized I've been overlooking something glaringly obvious... >>> there's no reason you can't compute multple blocks in advance. >> >> In the current code on the cost of per cpu allocations thus memory. > > Yes, but on 4-core machines it's still not much, and 4096-core > behemoths have RAM to burn. > >> In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte >> return block to feed it directly back into the state chacha? So we pass >> on 56 bytes into the pcpu buffer, and consume 8 bytes for the next >> state. This would make the window max shorter than the anti >> backtracking protection right now from 300s to 14 get_random_int calls. >> Not sure if this is worth it. > > We just finished discussing why 8 bytes isn't enough. If you only > feed back 8 bytes, an attacker who can do 2^64 computation can find it > (by guessing and computing forward to verify the guess) and recover the > previous state. You need to feed back at least as much output as your > security targete. For /dev/urandom's ChaCha20, that's 256 bits. I followed the discussion but it appeared to me that this has the additional constraint of time until the next reseeding event happenes, which is 300s (under the assumption that calls to get_random_int happen regularly, which I expect right now). After that the existing reseeding mechansim will ensure enough backtracking protection. The number of bytes can easily be increased here, given that reseeding was shown to be quite fast already and we produce enough output. But I am not sure if this is a bit overengineered in the end? >>> For example, suppose we gave each CPU a small pool to minimize locking. >>> When one runs out and calls the global refill, it could refill *all* >>> of the CPU pools. (You don't even need locking; there may be a race to >>> determine *which* random numbers the reader sees, but they're equally >>> good either way.) > >> Yes, but still disabled interrupts, otherwise the same state could be >> used twice on the same CPU. Uff, I think we have this problem in >> prandom_u32. > > There are some locking gotchas, but it is doable lock-free. > > Basically, it's a seqlock. The writer increments it once (to an odd > number) before starting to overwrite the buffer, and a second time (to > an even number) after. "Before" and "after" mean smp_wmb(). > > The reader can use this to figure out how much of the data in the buffer > is safely fresh. The full sequence of checks is a bit intricate, > but straightforward. > > I didn't discuss the locking because I'm confident it's solvable, > not because I wasn't aware it has to be solved. Also agreed. Given your solution below to prandom_u32, I do think it might also work without the seqlock now. > As for prandom_u32(), what's the problem? Are you worried that > get_cpu_var disables preemption but not interrupts, and so an > ISR might return the same value as process-level code? Yes, exactly those were my thoughts. > First of all, that's not a problem because prandom_u32() doesn't > have security guarantees. Occasionally returning a dupicate number > is okay. > > Second, if you do care, that could be trivially fixed by throwing > a barrier() in the middle of the code. (Patch appended; S-o-B > if anyone wants it.) I wouldn't have added a disable irqs, but given that I really like your proposal, I would take it in my todo branch and submit it when net-next window opens. > diff --git a/lib/random32.c b/lib/random32.c > index c750216d..6bee4a36 100644 > --- a/lib/random32.c > +++ b/lib/random32.c > [...] Thanks, Hannes
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Hannes Frederic Sowa wrote: > In general this looks good, but bitbuffer needs to be protected from > concurrent access, thus needing at least some atomic instruction and > disabling of interrupts for the locking if done outside of > get_random_long. Thus I liked your previous approach more where you > simply embed this in the already necessary get_random_long and aliased > get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO. It's meant to be part of the same approach, and I didn't include locking because that's a requirement for *any* solution, and isn't specific to the part I was trying to illustrate. (As for re-using the name "get_random_long", that was just so I didn't have to explain it. Call it "get_random_long_internal" if you like.) Possible locking implementations include: 1) Use the same locking as applies to get_random_long_internal(), or 2) Make bitbuffer a per-CPU variable (note that we currently have 128 bits of per-CPU state in get_random_int_hash[]), and this is all a fast-path to bypass heavier locking in get_random_long_internal(). >> But, I just realized I've been overlooking something glaringly obvious... >> there's no reason you can't compute multple blocks in advance. > > In the current code on the cost of per cpu allocations thus memory. Yes, but on 4-core machines it's still not much, and 4096-core behemoths have RAM to burn. > In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte > return block to feed it directly back into the state chacha? So we pass > on 56 bytes into the pcpu buffer, and consume 8 bytes for the next > state. This would make the window max shorter than the anti > backtracking protection right now from 300s to 14 get_random_int calls. > Not sure if this is worth it. We just finished discussing why 8 bytes isn't enough. If you only feed back 8 bytes, an attacker who can do 2^64 computation can find it (by guessing and computing forward to verify the guess) and recover the previous state. You need to feed back at least as much output as your security targete. For /dev/urandom's ChaCha20, that's 256 bits. >> For example, suppose we gave each CPU a small pool to minimize locking. >> When one runs out and calls the global refill, it could refill *all* >> of the CPU pools. (You don't even need locking; there may be a race to >> determine *which* random numbers the reader sees, but they're equally >> good either way.) > Yes, but still disabled interrupts, otherwise the same state could be > used twice on the same CPU. Uff, I think we have this problem in > prandom_u32. There are some locking gotchas, but it is doable lock-free. Basically, it's a seqlock. The writer increments it once (to an odd number) before starting to overwrite the buffer, and a second time (to an even number) after. "Before" and "after" mean smp_wmb(). The reader can use this to figure out how much of the data in the buffer is safely fresh. The full sequence of checks is a bit intricate, but straightforward. I didn't discuss the locking because I'm confident it's solvable, not because I wasn't aware it has to be solved. As for prandom_u32(), what's the problem? Are you worried that get_cpu_var disables preemption but not interrupts, and so an ISR might return the same value as process-level code? First of all, that's not a problem because prandom_u32() doesn't have security guarantees. Occasionally returning a dupicate number is okay. Second, if you do care, that could be trivially fixed by throwing a barrier() in the middle of the code. (Patch appended; S-o-B if anyone wants it.) diff --git a/lib/random32.c b/lib/random32.c index c750216d..6bee4a36 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -55,16 +55,29 @@ static DEFINE_PER_CPU(struct rnd_state, net_rand_state); * * This is used for pseudo-randomness with no outside seeding. * For more random results, use prandom_u32(). + * + * The barrier() is to allow prandom_u32() to be called from interupt + * context without locking. An interrupt will run to completion, + * updating all four state variables. The barrier() ensures that + * the interrupted code will compute a different result. Either it + * will have written s1 and s2 (so the interrupt will start with + * the updated values), or it will use the values of s3 and s4 + * updated by the interrupt. + * + * (The same logic applies recursively to nested interrupts, trap + * handlers, and NMIs.) */ u32 prandom_u32_state(struct rnd_state *state) { + register u32 x; #define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b) - state->s1 = TAUSWORTHE(state->s1, 6, 13, 4294967294U, 18U); - state->s2 = TAUSWORTHE(state->s2, 2, 27, 4294967288U, 2U); - state->s3 = TAUSWORTHE(state->s3, 13, 21, 4294967280U, 7U); - state->s4 = TAUSWORTHE(state->s4, 3, 12, 4294967168U, 13U); + x = state->s1 = TAUSWORTHE(state->s1, 6, 13, ~
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Hi, On Fri, 2016-12-23 at 13:26 -0500, George Spelvin wrote: > (Cc: list trimmed slightly as the topic is wandering a bit.) > > Hannes Frederic Sowa wrote: > > On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote: > > > Adding might_lock() annotations will improve coverage a lot. > > > > Might be hard to find the correct lock we take later down the code > > path, but if that is possible, certainly. > > The point of might_lock() is that you don't have to. You find the > worst case (most global) lock that the code *might* take if all the > buffer-empty conditions are true, and tell lockdep "assume this lock is > taken all the time". Yes, of course. Probably indicating input_pool's and primary_crng's locks are good to indicate here, but on NUMA other locks might be taken. > > > Hannes Frederic Sowa wrote: > > > > Yes, that does look nice indeed. Accounting for bits instead of bytes > > > > shouldn't be a huge problem either. Maybe it gets a bit more verbose in > > > > case you can't satisfy a request with one batched entropy block and have > > > > to consume randomness from two. > > For example, here's a simple bit-buffer implementation that wraps around > a get_random_long. The bitbuffer is of the form "1", where the > x bits are valid, and the position of the msbit indicates how many bits > are valid. > > extern unsigned long get_random_long(); > static unsigned long bitbuffer = 1; /* Holds 0..BITS_PER_LONG-1 bits */ > unsigned long get_random_bits(unsigned char bits) > { > /* We handle bits == BITS_PER_LONG,and not bits == 0 */ > unsigned long mask = -1ul >> (BITS_PER_LONG - bits); > unsigned long val; > > if (bitbuffer > mask) { > /* Request can be satisfied out of the bit buffer */ > val = bitbuffer; > bitbuffer >>= bits; > } else { > /* >* Not enough bits, but enough room in bitbuffer for the >* leftovers. avail < bits, so avail + 64 <= bits + 63. >*/ > val = get_random_long(); > bitbuffer = bitbuffer << (BITS_PER_LONG - bits) > | val >> 1 >> (bits-1); > } > return val & mask; > } In general this looks good, but bitbuffer needs to be protected from concurrent access, thus needing at least some atomic instruction and disabling of interrupts for the locking if done outside of get_random_long. Thus I liked your previous approach more where you simply embed this in the already necessary get_random_long and aliased get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO. > > When we hit the chacha20 without doing a reseed we only mutate the > > state of chacha, but being an invertible function in its own, a > > proposal would be to mix parts of the chacha20 output back into the > > state, which, as a result, would cause slowdown because we couldn't > > propagate the complete output of the cipher back to the caller (looking > > at the function _extract_crng). > > Basically, yes. Half of the output goes to rekeying itself. Okay, thanks and understood. :) > But, I just realized I've been overlooking something glaringly obvious... > there's no reason you can't compute multple blocks in advance. In the current code on the cost of per cpu allocations thus memory. > The standard assumption in antibacktracking is that you'll *notice* the > state capture and stop trusting the random numbers afterward; you just > want the output *before* to be secure. In other words, cops busting > down the door can't find the session key used in the message you just sent. Yep, analogous to forward secrecy and future secrecy (self healing) is provided by catastrophic reseeding from /dev/random. In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte return block to feed it directly back into the state chacha? So we pass on 56 bytes into the pcpu buffer, and consume 8 bytes for the next state. This would make the window max shorter than the anti backtracking protection right now from 300s to 14 get_random_int calls. Not sure if this is worth it. > So you can compute and store random numbers ahead of need. > > This can amortize the antibacktracking as much as you'd like. > > For example, suppose we gave each CPU a small pool to minimize locking. > When one runs out and calls the global refill, it could refill *all* > of the CPU pools. (You don't even need locking; there may be a race to > determine *which* random numbers the reader sees, but they're equally > good either way.) Yes, but still disabled interrupts, otherwise the same state could be used twice on the same CPU. Uff, I think we have this problem in prandom_u32. > > Or are you referring that the anti-backtrack protection should happen > > in every call from get_random_int? > > If you ask for anti-backtracking without qualification, that's the > goal, since you don't know how long will elapse until the next call.
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
(Cc: list trimmed slightly as the topic is wandering a bit.) Hannes Frederic Sowa wrote: > On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote: >> Adding might_lock() annotations will improve coverage a lot. > > Might be hard to find the correct lock we take later down the code > path, but if that is possible, certainly. The point of might_lock() is that you don't have to. You find the worst case (most global) lock that the code *might* take if all the buffer-empty conditions are true, and tell lockdep "assume this lock is taken all the time". >> Hannes Frederic Sowa wrote: >>> Yes, that does look nice indeed. Accounting for bits instead of bytes >>> shouldn't be a huge problem either. Maybe it gets a bit more verbose in >>> case you can't satisfy a request with one batched entropy block and have >>> to consume randomness from two. For example, here's a simple bit-buffer implementation that wraps around a get_random_long. The bitbuffer is of the form "1", where the x bits are valid, and the position of the msbit indicates how many bits are valid. extern unsigned long get_random_long(); static unsigned long bitbuffer = 1; /* Holds 0..BITS_PER_LONG-1 bits */ unsigned long get_random_bits(unsigned char bits) { /* We handle bits == BITS_PER_LONG,and not bits == 0 */ unsigned long mask = -1ul >> (BITS_PER_LONG - bits); unsigned long val; if (bitbuffer > mask) { /* Request can be satisfied out of the bit buffer */ val = bitbuffer; bitbuffer >>= bits; } else { /* * Not enough bits, but enough room in bitbuffer for the * leftovers. avail < bits, so avail + 64 <= bits + 63. */ val = get_random_long(); bitbuffer = bitbuffer << (BITS_PER_LONG - bits) | val >> 1 >> (bits-1); } return val & mask; } > When we hit the chacha20 without doing a reseed we only mutate the > state of chacha, but being an invertible function in its own, a > proposal would be to mix parts of the chacha20 output back into the > state, which, as a result, would cause slowdown because we couldn't > propagate the complete output of the cipher back to the caller (looking > at the function _extract_crng). Basically, yes. Half of the output goes to rekeying itself. But, I just realized I've been overlooking something glaringly obvious... there's no reason you can't compute multple blocks in advance. The standard assumption in antibacktracking is that you'll *notice* the state capture and stop trusting the random numbers afterward; you just want the output *before* to be secure. In other words, cops busting down the door can't find the session key used in the message you just sent. So you can compute and store random numbers ahead of need. This can amortize the antibacktracking as much as you'd like. For example, suppose we gave each CPU a small pool to minimize locking. When one runs out and calls the global refill, it could refill *all* of the CPU pools. (You don't even need locking; there may be a race to determine *which* random numbers the reader sees, but they're equally good either way.) > Or are you referring that the anti-backtrack protection should happen > in every call from get_random_int? If you ask for anti-backtracking without qualification, that's the goal, since you don't know how long will elapse until the next call. It's like fsync(). There are lots of more limited forms of "keep my data safe in case of a crash", but the most basic one is "if we lost power the very instant the call returned, the data would be safe."
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote: > Hannes Frederic Sowa wrote: > > A lockdep test should still be done. ;) > > Adding might_lock() annotations will improve coverage a lot. Might be hard to find the correct lock we take later down the code path, but if that is possible, certainly. > > Yes, that does look nice indeed. Accounting for bits instead of bytes > > shouldn't be a huge problem either. Maybe it gets a bit more verbose in > > case you can't satisfy a request with one batched entropy block and have > > to consume randomness from two. > > The bit granularity is also for the callers' convenience, so they don't > have to mask again. Whether get_random_bits rounds up to byte boundaries > internally or not is something else. > > When the current batch runs low, I was actually thinking of throwing > away the remaining bits and computing a new batch of 512. But it's > whatever works best at implementation time. > > > > > It could only mix the output back in every two calls, in which case > > > > you can backtrack up to one call but you need to do 2^128 work to > > > > backtrack farther. But yes, this is getting excessively complicated. > > > No, if you're willing to accept limited backtrack, this is a perfectly > > > acceptable solution, and not too complicated. You could do it phase-less > > > if you like; store the previous output, then after generating the new > > > one, mix in both. Then overwrite the previous output. (But doing two > > > rounds of a crypto primtive to avoid one conditional jump is stupid, > > > so forget that.) > > Can you quickly explain why we lose the backtracking capability? > > Sure. An RNG is (state[i], output[i]) = f(state[i-1]). The goal of > backtracking is to compute output[i], or better yet state[i-1], given > state[i]. > > For example, consider an OFB or CTR mode generator. The state is a key > and and IV, and you encrypt the IV with the key to produce output, then > either replace the IV with the output, or increment it. Either way, > since you still have the key, you can invert the transformation and > recover the previous IV. > > The standard way around this is to use the Davies-Meyer construction: > > IV[i] = IV[i-1] + E(IV[i-1], key) > > This is the standard way to make a non-invertible random function > out of an invertible random permutation. > > From the sum, there's no easy way to find the ciphertext *or* the > plaintext that was encrypted. Assuming the encryption is secure, > the only way to reverse it is brute force: guess IV[i-1] and run the > operation forward to see if the resultant IV[i] matches. > > There are a variety of ways to organize this computation, since the > guess gives toy both IV[i-1] and E(IV[i-1], key) = IV[i] - IV[i-1], including > running E forward, backward, or starting from both ends to see if you > meet in the middle. > > The way you add the encryption output to the IV is not very important. > It can be addition, xor, or some more complex invertible transformation. > In the case of SipHash, the "encryption" output is smaller than the > input, so we have to get a bit more creative, but it's still basically > the same thing. > > The problem is that the output which is combined with the IV is too small. > With only 64 bits, trying all possible values is practical. (The world's > Bitcoin miners are collectively computing SHA-256(SHA-256(input)) 1.7 * 2^64 > times per second.) > > By basically doing two iterations at once and mixing in 128 bits of > output, the guessing attack is rendered impractical. The only downside > is that you need to remember and store one result between when it's > computed and last used. This is part of the state, so an attack can > find output[i-1], but not anything farther back. Thanks a lot for the explanation! > > ChaCha as a block cipher gives a "perfect" permutation from the output > > of either the CRNG or the CPRNG, which actually itself has backtracking > > protection. > > I'm not quite understanding. The /dev/random implementation uses some > of the ChaCha output as a new ChaCha key (that's another way to mix output > back into the state) to prevent backtracking. But this slows it down, and > again if you want to be efficient, you're generating and storing large batches > of entropy and storing it in the RNG state. I was actually referring to the anti-backtrack protection in /dev/random and also /dev/urandom, from where we reseed every 300 seconds and if our batched entropy runs low with Ted's/Jason's current patch for get_random_int. As far as I can understand it, backtracking is not a problem in case of a reseed event inside extract_crng. When we hit the chacha20 without doing a reseed we only mutate the state of chacha, but being an invertible function in its own, a proposal would be to mix parts of the chacha20 output back into the state, which, as a result, would cause slowdown because we couldn't propagate the complete output of the cipher back
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Hannes Frederic Sowa wrote: > A lockdep test should still be done. ;) Adding might_lock() annotations will improve coverage a lot. > Yes, that does look nice indeed. Accounting for bits instead of bytes > shouldn't be a huge problem either. Maybe it gets a bit more verbose in > case you can't satisfy a request with one batched entropy block and have > to consume randomness from two. The bit granularity is also for the callers' convenience, so they don't have to mask again. Whether get_random_bits rounds up to byte boundaries internally or not is something else. When the current batch runs low, I was actually thinking of throwing away the remaining bits and computing a new batch of 512. But it's whatever works best at implementation time. >>> It could only mix the output back in every two calls, in which case >>> you can backtrack up to one call but you need to do 2^128 work to >>> backtrack farther. But yes, this is getting excessively complicated. > >> No, if you're willing to accept limited backtrack, this is a perfectly >> acceptable solution, and not too complicated. You could do it phase-less >> if you like; store the previous output, then after generating the new >> one, mix in both. Then overwrite the previous output. (But doing two >> rounds of a crypto primtive to avoid one conditional jump is stupid, >> so forget that.) > Can you quickly explain why we lose the backtracking capability? Sure. An RNG is (state[i], output[i]) = f(state[i-1]). The goal of backtracking is to compute output[i], or better yet state[i-1], given state[i]. For example, consider an OFB or CTR mode generator. The state is a key and and IV, and you encrypt the IV with the key to produce output, then either replace the IV with the output, or increment it. Either way, since you still have the key, you can invert the transformation and recover the previous IV. The standard way around this is to use the Davies-Meyer construction: IV[i] = IV[i-1] + E(IV[i-1], key) This is the standard way to make a non-invertible random function out of an invertible random permutation. >From the sum, there's no easy way to find the ciphertext *or* the plaintext that was encrypted. Assuming the encryption is secure, the only way to reverse it is brute force: guess IV[i-1] and run the operation forward to see if the resultant IV[i] matches. There are a variety of ways to organize this computation, since the guess gives toy both IV[i-1] and E(IV[i-1], key) = IV[i] - IV[i-1], including running E forward, backward, or starting from both ends to see if you meet in the middle. The way you add the encryption output to the IV is not very important. It can be addition, xor, or some more complex invertible transformation. In the case of SipHash, the "encryption" output is smaller than the input, so we have to get a bit more creative, but it's still basically the same thing. The problem is that the output which is combined with the IV is too small. With only 64 bits, trying all possible values is practical. (The world's Bitcoin miners are collectively computing SHA-256(SHA-256(input)) 1.7 * 2^64 times per second.) By basically doing two iterations at once and mixing in 128 bits of output, the guessing attack is rendered impractical. The only downside is that you need to remember and store one result between when it's computed and last used. This is part of the state, so an attack can find output[i-1], but not anything farther back. > ChaCha as a block cipher gives a "perfect" permutation from the output > of either the CRNG or the CPRNG, which actually itself has backtracking > protection. I'm not quite understanding. The /dev/random implementation uses some of the ChaCha output as a new ChaCha key (that's another way to mix output back into the state) to prevent backtracking. But this slows it down, and again if you want to be efficient, you're generating and storing large batches of entropy and storing it in the RNG state.
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On 22.12.2016 22:11, George Spelvin wrote: >> I do tend to like Ted's version in which we use batched >> get_random_bytes() output. If it's fast enough, it's simpler and lets >> us get the full strength of a CSPRNG. > > With the ChaCha20 generator, that's fine, although note that this abandons > anti-backtracking entirely. > > It also takes locks, something the previous get_random_int code > path avoided. Do we need to audit the call sites to ensure that's safe? We have spin_lock_irq* locks on the way. Of course they can hurt in when contended. The situation should be the same as with get_random_bytes, callable from every possible situation in the kernel without fear, so this should also work for get_random_int. A lockdep test should still be done. ;) > And there is the issue that the existing callers assume that there's a > fixed cost per word. A good half of get_random_long calls are followed by > "& ~PAGE_MASK" to extract the low 12 bits. Or "& ((1ul << mmap_rnd_bits) > - 1)" to extract the low 28. If we have a buffer we're going to have to > pay to refill, it would be nice to use less than 8 bytes to satisfy those. > > But that can be a followup patch. I'm thinking > > unsigned long get_random_bits(unsigned bits) > E.g. get_random_bits(PAGE_SHIFT), >get_random_bits(mmap_rnd_bits), > u32 imm_rnd = get_random_bits(32) > > unsigned get_random_mod(unsigned modulus) > E.g. get_random_mod(hole) & ~(alignment - 1); >get_random_mod(port_scan_backoff) > (Althogh probably drivers/s390/scsi/zfcp_fc.c should be changed > to prandom.) > > with, until the audit is completed: > #define get_random_int() get_random_bits(32) > #define get_random_long() get_random_bits(BITS_PER_LONG) Yes, that does look nice indeed. Accounting for bits instead of bytes shouldn't be a huge problem either. Maybe it gets a bit more verbose in case you can't satisfy a request with one batched entropy block and have to consume randomness from two. >> It could only mix the output back in every two calls, in which case >> you can backtrack up to one call but you need to do 2^128 work to >> backtrack farther. But yes, this is getting excessively complicated. > > No, if you're willing to accept limited backtrack, this is a perfectly > acceptable solution, and not too complicated. You could do it phase-less > if you like; store the previous output, then after generating the new > one, mix in both. Then overwrite the previous output. (But doing two > rounds of a crypto primtive to avoid one conditional jump is stupid, > so forget that.) Can you quickly explain why we lose the backtracking capability? ChaCha as a block cipher gives a "perfect" permutation from the output of either the CRNG or the CPRNG, which actually itself has backtracking protection. Thanks for explaining, Hannes
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
> I do tend to like Ted's version in which we use batched > get_random_bytes() output. If it's fast enough, it's simpler and lets > us get the full strength of a CSPRNG. With the ChaCha20 generator, that's fine, although note that this abandons anti-backtracking entirely. It also takes locks, something the previous get_random_int code path avoided. Do we need to audit the call sites to ensure that's safe? And there is the issue that the existing callers assume that there's a fixed cost per word. A good half of get_random_long calls are followed by "& ~PAGE_MASK" to extract the low 12 bits. Or "& ((1ul << mmap_rnd_bits) - 1)" to extract the low 28. If we have a buffer we're going to have to pay to refill, it would be nice to use less than 8 bytes to satisfy those. But that can be a followup patch. I'm thinking unsigned long get_random_bits(unsigned bits) E.g. get_random_bits(PAGE_SHIFT), get_random_bits(mmap_rnd_bits), u32 imm_rnd = get_random_bits(32) unsigned get_random_mod(unsigned modulus) E.g. get_random_mod(hole) & ~(alignment - 1); get_random_mod(port_scan_backoff) (Althogh probably drivers/s390/scsi/zfcp_fc.c should be changed to prandom.) with, until the audit is completed: #define get_random_int() get_random_bits(32) #define get_random_long() get_random_bits(BITS_PER_LONG) > It could only mix the output back in every two calls, in which case > you can backtrack up to one call but you need to do 2^128 work to > backtrack farther. But yes, this is getting excessively complicated. No, if you're willing to accept limited backtrack, this is a perfectly acceptable solution, and not too complicated. You could do it phase-less if you like; store the previous output, then after generating the new one, mix in both. Then overwrite the previous output. (But doing two rounds of a crypto primtive to avoid one conditional jump is stupid, so forget that.) >> Hmm, interesting. Although, for ASLR, we could use get_random_bytes() >> directly and be done with it. It won't be a bottleneck. Isn't that what you already suggested? I don't mind fewer primtives; I got a bit fixated on "Replace MD5 with SipHash". It's just the locking that I want to check isn't a problem.
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On Thu, Dec 22, 2016 at 11:24 AM, George Spelvin wrote: >> Having slept on this, I like it less. The problem is that a >> backtracking attacker doesn't just learn H(random seed || entropy_0 || >> secret || ...) -- they learn the internal state of the hash function >> that generates that value. This probably breaks any attempt to apply >> security properties of the hash function. For example, the internal >> state could easily contain a whole bunch of prior outputs it in >> verbatim. > > The problem is, anti-backtracing is in severe tension with your desire > to use unmodified SipHash. > > First of all, I'd like to repeat that it isn't a design goal of the current > generator and isn't necessary. Agreed. > Now, the main point: it's not likely to be solvable. > > The problem with unmodified SipHash is that is has only 64 bits of > output. No mix-back mechanism can get around the fundamental problem > that that's too small to prevent a brute-force guessing attack. You need > wider mix-back. And getting more output from unmodified SipHash requires > more finalization rounds, which is expensive. It could only mix the output back in every two calls, in which case you can backtrack up to one call but you need to do 2^128 work to backtrack farther. But yes, this is getting excessively complicated. > Finally, your discomfort about an attacker learning the internal state... > if an attacker knows the key and the input, they can construct the > internal state. Yes, we could discard the internal state and construct > a fresh one on the next call to get_random_int, but what are you going > to key it with? What are you going to feed it? What keeps *that* > internal state any more secret from an attacker than one that's explicitly > stored? I do tend to like Ted's version in which we use batched get_random_bytes() output. If it's fast enough, it's simpler and lets us get the full strength of a CSPRNG. (Aside: some day I want to move all that code from drivers/ to lib/ and teach it to be buildable in userspace, too, so it's easy to play with it, feed it test vectors, confirm that it's equivalent to a reference implementation, write up the actual design and try to get real cryptographers to analyze it, etc.) > For example, clearly stating the concern over starting new processes > with predictable layout, and the limits on the fresh entropy supply, > has made me realize that there *is* a possible source: each exec() > is passed 128 bits from get_random_bytes in the AT_RANDOM element of > its auxv. Since get_random_int() accesses "current" anyway, we could > store some seed material there rather than using "pid". While this is > not fresh for each call to get_random_int, it *is* fresh for each new > address space whose layout is being randomized. Hmm, interesting. Although, for ASLR, we could use get_random_bytes() directly and be done with it. It won't be a bottleneck.
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
> Having slept on this, I like it less. The problem is that a > backtracking attacker doesn't just learn H(random seed || entropy_0 || > secret || ...) -- they learn the internal state of the hash function > that generates that value. This probably breaks any attempt to apply > security properties of the hash function. For example, the internal > state could easily contain a whole bunch of prior outputs it in > verbatim. The problem is, anti-backtracing is in severe tension with your desire to use unmodified SipHash. First of all, I'd like to repeat that it isn't a design goal of the current generator and isn't necessary. The current generator just returns hash[0] from the MD5 state, then leaves the state stored. The fact that it conceals earlier outputs is an accident of the Davies-Meyer structure of md5_transform. It isn't necessary, because every secret generated is stored unencrypted for as long as it's of value. A few values are used for retransmit backoffs and random MAC addresses. Those are revealed to the world as soon as they're used. Most values are used for ASLR. These address are of interest to an attacker trying to mount a buffer-overflow attack, but that only lasts as long as the process is running to receive buffers. After the process exits, knowledge of its layout is worthless. And this is stored as long as it's running in kernel-accessible data, so storing a *second* copy in less conveniently kernel-accessible data (the RNG state) doesn't make the situation any worse. In addition to the above, if you're assuming a state capture, then since we have (for necessary efficieny reasons) a negligible about of fresh entropy, an attacker has the secret key and can predict *future* outputs very easily. Given that information, an attacker doesn't need to learn the layout of vulnerable server X. If they have a buffer overflow, they can crash the current instance and wait for a fresh image to be started (with a known address space) to launch their attack at. Kernel state capture attacks are a very unlikely attack, mostly because it's a narrow target a hair's breadth away from the much more interesting outright kernel compromise (attacker gains write access as well as read) which renders all this fancy cryptanaysis moot. Now, the main point: it's not likely to be solvable. The problem with unmodified SipHash is that is has only 64 bits of output. No mix-back mechanism can get around the fundamental problem that that's too small to prevent a brute-force guessing attack. You need wider mix-back. And getting more output from unmodified SipHash requires more finalization rounds, which is expensive. (Aside: 64 bits does have the advantage that it can't be brute-forced on the attacked machine. It must be exfiltrated to the attacker, and the solution returned to the attack code. But doing this is getting easier all the time.) Adding antibacktracking to SipHash is trivial: just add a Davies-Meyer structure around its internal state. Remember the internal state before hashing in the entropy and secret, generate the output, then add the previous and final states together for storage. This is a standard textbook construction, very cheap, and doesn't touch the compression function which is the target of analysis and attacks, but it requires poking around in SipHash's internal state. (And people who read the textbooks without understanding them will get upset because the textbooks all talk about using this construction with block ciphers, and SipHash's compression function is not advertised as a block cipher.) Alternative designs exist; you could extract additional output from earlier rounds of SipHash, using the duplex sponge construction you mentioned earlier. That output would be used for mixback purposes *only*, so wouldn't affect the security proof of the "primary" output. But this is also getting creative with SipHash's internals. Now, you could use a completely *different* cryptographic primitive to enforce one-way-ness, and apply SipHash as a strong output transform, but that doesn't feel like good design, and is probably more expensive. Finally, your discomfort about an attacker learning the internal state... if an attacker knows the key and the input, they can construct the internal state. Yes, we could discard the internal state and construct a fresh one on the next call to get_random_int, but what are you going to key it with? What are you going to feed it? What keeps *that* internal state any more secret from an attacker than one that's explicitly stored? Keeping the internal state around is a cacheing optimization, that's all. *If* you're assuming a state capture, the only thing secret from the attacker is any fresh entropy collected between the time of capture and the time of generation. Due to mandatory efficiency requirements, this is very small. I really think you're wishing for the impossible here. A final note: although I'm disagreeing with you,
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On Wed, Dec 21, 2016 at 6:07 PM, Andy Lutomirski wrote: > On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin > wrote: >> As a separate message, to disentangle the threads, I'd like to >> talk about get_random_long(). >> >> After some thinking, I still like the "state-preserving" construct >> that's equivalent to the current MD5 code. Yes, we could just do >> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to >> preserve a bit more. >> >> It requires library support from the SipHash code to return the full >> SipHash state, but I hope that's a fair thing to ask for. > > I don't even think it needs that. This is just adding a > non-destructive final operation, right? > >> >> Here's my current straw man design for comment. It's very similar to >> the current MD5-based design, but feeds all the seed material in the >> "correct" way, as opposed to Xring directly into the MD5 state. >> >> * Each CPU has a (Half)SipHash state vector, >> "unsigned long get_random_int_hash[4]". Unlike the current >> MD5 code, we take care to initialize it to an asymmetric state. >> >> * There's a global 256-bit random_int_secret (which we could >> reseed periodically). >> >> To generate a random number: >> * If get_random_int_hash is all-zero, seed it with fresh a half-sized >> SipHash key and the appropriate XOR constants. >> * Generate three words of random_get_entropy(), jiffies, and current->pid. >> (This is arbitary seed material, copied from the current code.) >> * Crank through that with (Half)SipHash-1-0. >> * Crank through the random_int_secret with (Half)SipHash-1-0. >> * Return v1 ^ v3. > > Just to clarify, if we replace SipHash with a black box, I think this > effectively means, where "entropy" is random_get_entropy() || jiffies > || current->pid: > > The first call returns H(random seed || entropy_0 || secret). The > second call returns H(random seed || entropy_0 || secret || entropy_1 > || secret). Etc. Having slept on this, I like it less. The problem is that a backtracking attacker doesn't just learn H(random seed || entropy_0 || secret || ...) -- they learn the internal state of the hash function that generates that value. This probably breaks any attempt to apply security properties of the hash function. For example, the internal state could easily contain a whole bunch of prior outputs it in verbatim. --Andy
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
> True, but it's called get_random_int(), and it seems like making it > stronger, especially if the performance cost is low to zero, is a good > thing. If it's cheap enough, I don't mind. But it's documented as being marginal-quality, used where speed is more important. In particular, it's *not* used for key material; only values that matter only as long as they are in use. Whule they're in use, they can't be concealed from an attacker with kernel access, and when they're dne being used, they're worthless. >> If you want anti-backtracking, though, it's easy to add. What we >> hash is: >> >> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ... >> >> You mix the output word right back in to the (unfinalized) state after >> generating it. This is still equivalent to unmodified back-box SipHash, >> you're just using a (conceptually independent) SipHash invocation to >> produce some of its input. > Ah, cute. This could probably be sped up by doing something like: > > entropy_0 || secret || output_0 ^ entropy_1 || secret || ... I'm disinclined to do that because that requires deferring the mixing until the *next* time you generate something. Storing the value you don't want revealed by a state capture defeats the purpose. I'd rather mix it in immediately, so you have anti-backtracking from the moment of creation. Also, I don't think it's particularly "cute" or clever; mixing output back in is the standard way all antibacktracking is accomplished. It's how the Davies-Meyer hash construction makes a reversible function one-way. (There is a second way to do it by throwing away state, but that's expensive in seed entropy.) > It's a little weak because the output is only 64 bits, so you could > plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the > old entropy is guessable. I suspect there are sneaky ways around it > like using output_n-1 ^ output_n-2 or similar. I'll sleep on it. Ah, yes, I see. Given the final state, you guess the output word, go back one round, then forward the finalization rounds. Is the output equal to the guessed output? You'll find the true value, plus Poisson(1 - 2^-64) additional. (Since you have 2^64-1 chances at something with probability 1 in 2^64.) And this can be iterated as often as you like to get earlier output words, as long as you can guess the entropy. *That's* the part that hurts; you'd like something that peters out. You could use the double-length-output SipHash variant (which requires a second set of finalization rounds) and mix more output back, but that's expensive. The challenge is coming up with more unpredictable data to mix in than one invocation of SipHash returns. And without storing previous output anywhere, because that is exactly wrong. A running sum or xor or whatever of the outputs doesn't help, because once you've guessed the last output, you can backtrack that for no additional effort. State capture is incredibly difficult, our application doesn't require resistance anyway... unless you can think of something cheap, I think we can just live with this. >> I'd *like* to persuade you that skipping the padding byte wouldn't >> invalidate any security proofs, because it's true and would simplify >> the code. But if you want 100% stock, I'm willing to cater to that. > I lean toward stock in the absence of a particularly good reason. At > the very least I'd want to read that paper carefully. Er... adding the length is standard Merkle-Damgaard strengthening. Why you do this is described in the original papers by Merkle and Damgaard. The lay summary is at https://en.wikipedia.org/wiki/Merkle-Damgard_construction The original sources are: http://www.merkle.com/papers/Thesis1979.pdf http://saluc.engr.uconn.edu/refs/algorithms/hashalg/damgard89adesign.pdf Merkle describes the construction; Damgaard proves it secure. Basically, appending the length is required to handle variable-length input if the input is not itself self-delimiting. The proof of security is theorem 3.1 in the latter. (The first, more detailed explanation involves the use of an extra bit, which the second then explains how todo without.) In particular, see the top of page 420, which notes that the security proof only requires encoding *how much padding is added* in the final block, not the overall length of the message, and the second remark on p. 421 which notes that no such suffix is required if it's not necessary to distinguish messages with different numbers of trailing null bytes. The rules are alluded to in the "Choice of padding rule" part of the "Rationale" section of the SipHash paper (p. 7), but the description is very brief because it assumes the reader has the background. That's why they say "We could have chosen a slightly simpler padding rule, such as appending a 80 byte followed by zeroes." The thing is, if the amount of the last block that is used is fixed (within the domain of a particular key), you don't nee
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On Wed, Dec 21, 2016 at 9:01 PM, George Spelvin wrote: > Andy Lutomirski wrote: >> I don't even think it needs that. This is just adding a >> non-destructive final operation, right? > > It is, but the problem is that SipHash is intended for *small* inputs, > so the standard implementations aren't broken into init/update/final > functions. > > There's just one big function that keeps the state variables in > registers and never stores them anywhere. > > If we *had* init/update/final functions, then it would be trivial. > >> Just to clarify, if we replace SipHash with a black box, I think this >> effectively means, where "entropy" is random_get_entropy() || jiffies >> || current->pid: > >> The first call returns H(random seed || entropy_0 || secret). The >> second call returns H(random seed || entropy_0 || secret || entropy_1 >> || secret). Etc. > > Basically, yes. I was skipping the padding byte and keying the > finalization rounds on the grounds of "can't hurt and might help", > but we could do it a more standard way. > >> If not, then I have a fairly strong preference to keep whatever >> construction we come up with consistent with something that could >> actually happen with invocations of unmodified SipHash -- then all the >> security analysis on SipHash goes through. > > Okay. I don't think it makes a difference, but it's not a *big* waste > of time. If we have finalization rounds, we can reduce the secret > to 128 bits. > > If we include the padding byte, we can do one of two things: > 1) Make the secret 184 bits, to fill up the final partial word as >much as possible, or > 2) Make the entropy 1 byte smaller and conceptually misalign the >secret. What we'd actually do is remove the last byte of >the secret and include it in the entropy words, but that's >just a rotation of the secret between storage and hashing. > > Also, I assume you'd like SipHash-2-4, since you want to rely > on a security analysis. I haven't looked, but I assume that the analysis at least thought about reduced rounds, so maybe other variants are okay. >> The one thing I don't like is >> that I don't see how to prove that you can't run it backwards if you >> manage to acquire a memory dump. In fact, I that that there exist, at >> least in theory, hash functions that are secure in the random oracle >> model but that *can* be run backwards given the full state. From >> memory, SHA-3 has exactly that property, and it would be a bit sad for >> a CSPRNG to be reversible. > > Er... get_random_int() is specifically *not* designed to be resistant > to state capture, and I didn't try. Remember, what it's used for > is ASLR, what we're worried about is somene learning the layouts > of still-running processes, and and if you get a memory dump, you have > the memory layout! True, but it's called get_random_int(), and it seems like making it stronger, especially if the performance cost is low to zero, is a good thing. > > If you want anti-backtracking, though, it's easy to add. What we > hash is: > > entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ... > > You mix the output word right back in to the (unfinalized) state after > generating it. This is still equivalent to unmodified back-box SipHash, > you're just using a (conceptually independent) SipHash invocation to > produce some of its input. Ah, cute. This could probably be sped up by doing something like: entropy_0 || secret || output_0 ^ entropy_1 || secret || ... It's a little weak because the output is only 64 bits, so you could plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the old entropy is guessable. I suspect there are sneaky ways around it like using output_n-1 ^ output_n-2 or similar. I'll sleep on it. > > The only remaining issues are: > 1) How many rounds, and > 2) May we use HalfSipHash? I haven't looked closely enough to have a real opinion here. I don't know what the security margin is believed to be. > > I'd *like* to persuade you that skipping the padding byte wouldn't > invalidate any security proofs, because it's true and would simplify > the code. But if you want 100% stock, I'm willing to cater to that. I lean toward stock in the absence of a particularly good reason. At the very least I'd want to read that paper carefully. > > Ted, what do you think? -- Andy Lutomirski AMA Capital Management, LLC
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Andy Lutomirski wrote: > I don't even think it needs that. This is just adding a > non-destructive final operation, right? It is, but the problem is that SipHash is intended for *small* inputs, so the standard implementations aren't broken into init/update/final functions. There's just one big function that keeps the state variables in registers and never stores them anywhere. If we *had* init/update/final functions, then it would be trivial. > Just to clarify, if we replace SipHash with a black box, I think this > effectively means, where "entropy" is random_get_entropy() || jiffies > || current->pid: > The first call returns H(random seed || entropy_0 || secret). The > second call returns H(random seed || entropy_0 || secret || entropy_1 > || secret). Etc. Basically, yes. I was skipping the padding byte and keying the finalization rounds on the grounds of "can't hurt and might help", but we could do it a more standard way. > If not, then I have a fairly strong preference to keep whatever > construction we come up with consistent with something that could > actually happen with invocations of unmodified SipHash -- then all the > security analysis on SipHash goes through. Okay. I don't think it makes a difference, but it's not a *big* waste of time. If we have finalization rounds, we can reduce the secret to 128 bits. If we include the padding byte, we can do one of two things: 1) Make the secret 184 bits, to fill up the final partial word as much as possible, or 2) Make the entropy 1 byte smaller and conceptually misalign the secret. What we'd actually do is remove the last byte of the secret and include it in the entropy words, but that's just a rotation of the secret between storage and hashing. Also, I assume you'd like SipHash-2-4, since you want to rely on a security analysis. (Regarding the padding byte, getting it right might be annoying to do exactly. All of the security analysis depends *only* on its low 3 bits indicating how much of the final block is used. As it says in the SipHash paper, they included 8 bits just because it was easy. But if you want it exact, it's just one more byte of state.) > The one thing I don't like is > that I don't see how to prove that you can't run it backwards if you > manage to acquire a memory dump. In fact, I that that there exist, at > least in theory, hash functions that are secure in the random oracle > model but that *can* be run backwards given the full state. From > memory, SHA-3 has exactly that property, and it would be a bit sad for > a CSPRNG to be reversible. Er... get_random_int() is specifically *not* designed to be resistant to state capture, and I didn't try. Remember, what it's used for is ASLR, what we're worried about is somene learning the layouts of still-running processes, and and if you get a memory dump, you have the memory layout! If you want anti-backtracking, though, it's easy to add. What we hash is: entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ... You mix the output word right back in to the (unfinalized) state after generating it. This is still equivalent to unmodified back-box SipHash, you're just using a (conceptually independent) SipHash invocation to produce some of its input. Each output is produced by copying the state, padding & finalizing after the secret. In fact, to make our lives easier, let's define the secret to end with a counter byte that happens to be equal to the padding byte. The input stream will be: Previous output: 8 (or 4 for HalfSipHash) bytes Entropy: 15 bytes (8 bytes timer, 4 bytes jiffies, 3 bytes pid) Secret: 16 bytes Counter: 1 byte ...repeat... > We could also periodically mix in a big (128-bit?) chunk of fresh > urandom output to keep the bad guys guessing. Simpler and faster to just update the global master secret. The state is per-CPU, so mixing in has to be repeated per CPU. With these changes, I'm satisifed that it's secure, cheap, has a sufficiently wide state size, *and* all standard SipHash analysis applies. The only remaining issues are: 1) How many rounds, and 2) May we use HalfSipHash? I'd *like* to persuade you that skipping the padding byte wouldn't invalidate any security proofs, because it's true and would simplify the code. But if you want 100% stock, I'm willing to cater to that. Ted, what do you think?
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
Hi George, On Thu, Dec 22, 2016 at 4:55 AM, George Spelvin wrote: > Do we have to go through this? No, the benchmark was *not* bogus. > Then I replaced the kernel #includes with the necessary typedefs > and #defines to make it compile in user-space. > * I didn't iterate 100K times, I timed the functions *once*. > * I saved the times in a buffer and printed them all at the end > so printf() wouldn't pollute the caches. > * Before every even-numbered iteration, I flushed the I-cache > of everything from _init to _fini (i.e. all the non-library code). > This cold-cache case is what is going to happen in the kernel. Wow! Great. Thanks for the pointers on the right way to do this. Very helpful, and enlightening results indeed. Think you could send me the whole .c of what you finally came up with? I'd like to run this on some other architectures; I've got a few odd boxes laying around here. > The P4 results were: > SipHash actually wins slightly in the cold-cache case, because > it iterates more. In the hot-cache case, it loses > Core 2 duo: > Pretty much a tie, honestly. > Ivy Bridge: > Modern processors *hate* cold caches. But notice how md5 is *faster* > than SipHash on hot-cache IPv6. > Ivy Bridge, -m64: > Of course, when you compile -m64, SipHash is unbeatable. Okay, so I think these results are consistent with some of the assessments from before -- that SipHash is really just fine as a replacement for MD5. Not great on older 32-bit x86, but not too horrible, and the performance improvements on every other architecture and the security improvements everywhere are a net good. > Here's the modified benchmark() code. The entire package is > a bit voluminous for the mailing list, but anyone is welcome to it. Please do send! I'm sure I'll learn from reading it. Thanks again for doing the hardwork of putting something proper together. Thanks, Jason
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
> Plus the benchmark was bogus anyway, and when I built a more specific > harness -- actually comparing the TCP sequence number functions -- > SipHash was faster than MD5, even on register starved x86. So I think > we're fine and this chapter of the discussion can come to a close, in > order to move on to more interesting things. Do we have to go through this? No, the benchmark was *not* bogus. Here's myresults from *your* benchmark. I can't reboot some of my test machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and include/linux/siphash.h straight out of your test tree. Then I replaced the kernel #includes with the necessary typedefs and #defines to make it compile in user-space. (Voluminous but straightforward.) E.g. #define __aligned(x) __attribute__((__aligned__(x))) #define cacheline_aligned __aligned(64) #define CONFIG_INET 1 #define IS_ENABLED(x) 1 #define ktime_get_real_ns() 0 #define sysctl_tcp_timestamps 0 ... etc. Then I modified your benchmark code into the appended code. The differences are: * I didn't iterate 100K times, I timed the functions *once*. * I saved the times in a buffer and printed them all at the end so printf() wouldn't pollute the caches. * Before every even-numbered iteration, I flushed the I-cache of everything from _init to _fini (i.e. all the non-library code). This cold-cache case is what is going to happen in the kernel. In the results below, note that I did *not* re-flush between phases of the test. The effects of cacheing is clearly apparent in the tcpv4 results, where the tcpv6 code loaded the cache. You can also see that the SipHash code benefits more from cacheing when entered with a cold cache, as it iterates over the input words, while the MD5 code is one big unrolled blob. Order of computation is down the columns first, across second. The P4 results were: tcpv6 md5 cold: 40843488358435843568 tcpv4 md5 cold: 1052 996 9961060 996 tcpv6 siphash cold: 40803296331232963312 tcpv4 siphash cold: 29682748297227162716 tcpv6 md5 hot: 900 712 712712 712 tcpv4 md5 hot: 632 672 672672 672 tcpv6 siphash hot: 24842292234023402340 tcpv4 siphash hot: 16601560156423401564 SipHash actually wins slightly in the cold-cache case, because it iterates more. In the hot-cache case, it loses horribly. Core 2 duo: tcpv6 md5 cold: 33962868296430122832 tcpv4 md5 cold: 13681044132013321308 tcpv6 siphash cold: 29402952291624482604 tcpv4 siphash cold: 31922988357635043624 tcpv6 md5 hot: 11161032 99610081008 tcpv4 md5 hot: 936 936 936 936 936 tcpv6 siphash hot: 12001236123611881188 tcpv4 siphash hot: 936 804 804 804 804 Pretty much a tie, honestly. Ivy Bridge: tcpv6 md5 cold: 60866136696263586060 tcpv4 md5 cold: 816 732104610541012 tcpv6 siphash cold: 37561886215223902566 tcpv4 siphash cold: 32642108302631203526 tcpv6 md5 hot: 1062 808 824 824 832 tcpv4 md5 hot: 730 730 740 748 748 tcpv6 siphash hot: 960 952 9361112 926 tcpv4 siphash hot: 638 544 562 552 560 Modern processors *hate* cold caches. But notice how md5 is *faster* than SipHash on hot-cache IPv6. Ivy Bridge, -m64: tcpv6 md5 cold: 46803672395636163525 tcpv4 md5 cold: 10661416117911791134 tcpv6 siphash cold: 9401258199516092255 tcpv4 siphash cold: 14401269129218701621 tcpv6 md5 hot: 1372112210881088 tcpv4 md5 hot: 997 997 997 997 998 tcpv6 siphash hot: 340 340 340 352 340 tcpv4 siphash hot: 227 238 238 238 238 Of course, when you compile -m64, SipHash is unbeatable. Here's the modified benchmark() code. The entire package is a bit voluminous for the mailing list, but anyone is welcome to it. static void clflush(void) { extern char const _init, _fini; char const *p = &_init; while (p < &_fini) { asm("clflush %0" : : "m" (*p)); p += 64; } } typedef uint32_t cycles_t; static cycles_t get_cycles(void) { uint32_t eax, edx; asm volatile("rdtsc" : "=a" (eax), "=d" (edx)); return eax; } static int benchmark(void) { cycles_t start, finish; int i; u32 seq_number = 0; __be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 222, 0xff10 }; __be32 saddr4 = 2, daddr4 = 182112; __be16 sport = 22, dport = 41992;
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
> On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin >> After some thinking, I still like the "state-preserving" construct >> that's equivalent to the current MD5 code. Yes, we could just do >> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to >> preserve a bit more. >> >> It requires library support from the SipHash code to return the full >> SipHash state, but I hope that's a fair thing to ask for. This is not a good idea. If I understand correctly, the idea here is to just keep around SipHash's internal state variables, and chain them over to the next call, sort of like how md5_transform with the current code works on the same scratch space. There has been no security analysis in the literature on this use of the primitive, and I have no confidence that this is a secure use of the function. Unless somebody can point me toward a paper I missed or a comment from a real cryptographer about the specifics of SipHash, I think I'm right to admonish against this dangerous road. Let's talk about constructions. And let's only decide on a construction that we're actually equipped to analyze. Let's definitely not talk about making our own primitives, or retrofitting nice primitive's internals into our own Frankenstein. Alternatively, if I'm wrong, please send me an eprint/arXiv link to a paper that discusses this use of SipHash.
George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin wrote: > As a separate message, to disentangle the threads, I'd like to > talk about get_random_long(). > > After some thinking, I still like the "state-preserving" construct > that's equivalent to the current MD5 code. Yes, we could just do > siphash(current_cpu || per_cpu_counter, global_key), but it's nice to > preserve a bit more. > > It requires library support from the SipHash code to return the full > SipHash state, but I hope that's a fair thing to ask for. I don't even think it needs that. This is just adding a non-destructive final operation, right? > > Here's my current straw man design for comment. It's very similar to > the current MD5-based design, but feeds all the seed material in the > "correct" way, as opposed to Xring directly into the MD5 state. > > * Each CPU has a (Half)SipHash state vector, > "unsigned long get_random_int_hash[4]". Unlike the current > MD5 code, we take care to initialize it to an asymmetric state. > > * There's a global 256-bit random_int_secret (which we could > reseed periodically). > > To generate a random number: > * If get_random_int_hash is all-zero, seed it with fresh a half-sized > SipHash key and the appropriate XOR constants. > * Generate three words of random_get_entropy(), jiffies, and current->pid. > (This is arbitary seed material, copied from the current code.) > * Crank through that with (Half)SipHash-1-0. > * Crank through the random_int_secret with (Half)SipHash-1-0. > * Return v1 ^ v3. Just to clarify, if we replace SipHash with a black box, I think this effectively means, where "entropy" is random_get_entropy() || jiffies || current->pid: The first call returns H(random seed || entropy_0 || secret). The second call returns H(random seed || entropy_0 || secret || entropy_1 || secret). Etc. If not, then I have a fairly strong preference to keep whatever construction we come up with consistent with something that could actually happen with invocations of unmodified SipHash -- then all the security analysis on SipHash goes through. Anyway, I have mixed thoughts about the construction. It manages to have a wide state at essentially no cost, which buys us quite a bit of work factor to break it. Even with full knowledge of the state, an output doesn't reveal the entropy except to the extent that it can be brute-force (this is just whatever the appropriate extended version of first preimage resistance gives us). The one thing I don't like is that I don't see how to prove that you can't run it backwards if you manage to acquire a memory dump. In fact, I that that there exist, at least in theory, hash functions that are secure in the random oracle model but that *can* be run backwards given the full state. From memory, SHA-3 has exactly that property, and it would be a bit sad for a CSPRNG to be reversible. We could also periodically mix in a big (128-bit?) chunk of fresh urandom output to keep the bad guys guessing. (P.S. This kind of resembles the duplex sponge construction. If hardware SHA-3 ever shows up, a duplex sponge RNG might nice indeed.)
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvalds wrote: > On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin > wrote: >> >> How much does kernel_fpu_begin()/kernel_fpu_end() cost? > > It's now better than it used to be, but it's absolutely disastrous > still. We're talking easily many hundreds of cycles. Under some loads, > thousands. > > And I warn you already: it will _benchmark_ a hell of a lot better > than it will work in reality. In benchmarks, you'll hit all the > optimizations ("oh, I've already saved away all the FP registers, no > need to do it again"). > > In contrast, in reality, especially with things like "do it once or > twice per incoming packet", you'll easily hit the absolute worst > cases, where not only does it take a few hundred cycles to save the FP > state, you'll then return to user space in between packets, which > triggers the slow-path return code and reloads the FP state, which is > another few hundred cycles plus. Hah, you're thinking that the x86 code works the way that Rik and I want it to work, and you just made my day. :) What actually happens is that the state is saved in kernel_fpu_begin() and restored in kernel_fpu_end(), and it'll take a few hundred cycles best case. If you do it a bunch of times in a loop, you *might* trigger a CPU optimization that notices that the state being saved is the same state that was just restored, but you're still going to pay the full restore code each round trip no matter what. The code is much clearer in 4.10 kernels now that I deleted the unused "lazy" branches. > > Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX > unit and keep it powered up for a while afterwards", while in real > life you would quite easily hit the "oh, AVX is powered down because > we were idle, now it powers up at half speed which is another latency > hit _and_ the AVX unit won't run full out anyway". I *think* that was mostly fixed in Broadwell or thereabouts (in terms of latency -- throughput and power consumption still suffers).
Re: HalfSipHash Acceptable Usage
As a separate message, to disentangle the threads, I'd like to talk about get_random_long(). After some thinking, I still like the "state-preserving" construct that's equivalent to the current MD5 code. Yes, we could just do siphash(current_cpu || per_cpu_counter, global_key), but it's nice to preserve a bit more. It requires library support from the SipHash code to return the full SipHash state, but I hope that's a fair thing to ask for. Here's my current straw man design for comment. It's very similar to the current MD5-based design, but feeds all the seed material in the "correct" way, as opposed to Xring directly into the MD5 state. * Each CPU has a (Half)SipHash state vector, "unsigned long get_random_int_hash[4]". Unlike the current MD5 code, we take care to initialize it to an asymmetric state. * There's a global 256-bit random_int_secret (which we could reseed periodically). To generate a random number: * If get_random_int_hash is all-zero, seed it with fresh a half-sized SipHash key and the appropriate XOR constants. * Generate three words of random_get_entropy(), jiffies, and current->pid. (This is arbitary seed material, copied from the current code.) * Crank through that with (Half)SipHash-1-0. * Crank through the random_int_secret with (Half)SipHash-1-0. * Return v1 ^ v3. Here are the reasons: * The first step is just paranoia, but SipHash's security promise depends on starting with an asymmetric state, we want unique per-CPU states, and it's a one-time cost. * When the input words are themselves secret, there's no security advantage, and almost no speed advantage, to doing two rounds for one input word versus two words with one round each. Thus, SipHash-1. * The above is not exactly true, due to the before+after XOR pattern that SipHash uses, but I think it's true anyway. * Likewise, there's no benefit to unkeyed finalization rounds over keyed ones. That's why I just enlarged the global secret. * The per-call seed material is hashed first on general principles, because that's the novel part that might have fresh entropy. * To the extent the initial state is secret, the rounds processing the global secret are 4 finalization rounds for the initial state and the per-call entropy. * The final word(s) of the global secret might be vulnerable to analysis, due to incomplete mixing, but since the global secret is always hashed in the same order, and larger that the desired security level, the initial words should be secure. * By carrying forward the full internal state, we ensure that repeated calls return different results, and to the extent that the per-call seed material has entropy, it's preserved. * The final return is all that's needed, since the last steps in the SipRound are "v1 ^= v2" and "v3 ^= v0". It's no security loss, and a very minor speedup. * Also, this avoids directly "exposing" the final XOR with the last word of the global secret (which is made to v0). If I'm allowed to use full SipHash, some shortcuts can be taken, but I believe the above would be secure with HalfSipHash. If additional performance is required, I'd consider shrinking the global secret to 192 bits on 32-bit machines but I want more than 128 bits of ey material, and enough rounds to be equivalent to 4 finalization rounds.
Re: HalfSipHash Acceptable Usage
Theodore Ts'o wrote: > On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote: >> SipHash annihilates the competition on 64-bit superscalar hardware. >> SipHash dominates the field on 64-bit in-order hardware. >> SipHash wins easily on 32-bit hardware *with enough registers*. >> On register-starved 32-bit machines, it really struggles. > And "with enough registers" includes ARM and MIPS, right? Yes. As a matter of fact, 32-bit ARM does particularly well on 64-bit SipHash due to its shift+op instructions. There is a noticeable performance drop, but nothing catastrophic. The main thing I've been worried about is all the flow tracking and NAT done by small home routers, and that's addressed by using HalfSipHash for the hash tables. They don't *initiate* a lot of TCP sessions. > So the only > real problem is 32-bit x86, and you're right, at that point, only > people who might care are people who are using a space-radiation > hardened 386 --- and they're not likely to be doing high throughput > TCP connections. :-) The only requirement on performance is "don't make DaveM angry." :-) I was just trying to answer the question of why we *worried* about the performance, not specifically argue that we *should* use HalfSipHash.
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 11:27 PM, Theodore Ts'o wrote: > And "with enough registers" includes ARM and MIPS, right? So the only > real problem is 32-bit x86, and you're right, at that point, only > people who might care are people who are using a space-radiation > hardened 386 --- and they're not likely to be doing high throughput > TCP connections. :-) Plus the benchmark was bogus anyway, and when I built a more specific harness -- actually comparing the TCP sequence number functions -- SipHash was faster than MD5, even on register starved x86. So I think we're fine and this chapter of the discussion can come to a close, in order to move on to more interesting things.
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote: > SipHash annihilates the competition on 64-bit superscalar hardware. > SipHash dominates the field on 64-bit in-order hardware. > SipHash wins easily on 32-bit hardware *with enough registers*. > On register-starved 32-bit machines, it really struggles. And "with enough registers" includes ARM and MIPS, right? So the only real problem is 32-bit x86, and you're right, at that point, only people who might care are people who are using a space-radiation hardened 386 --- and they're not likely to be doing high throughput TCP connections. :-) - Ted
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 7:37 PM, George Spelvin wrote: > SipHash annihilates the competition on 64-bit superscalar hardware. > SipHash dominates the field on 64-bit in-order hardware. > SipHash wins easily on 32-bit hardware *with enough registers*. > On register-starved 32-bit machines, it really struggles. > > As I explained, in that last case, SipHash barely wins at all. > (On a P4, it actually *loses* to MD5, not that anyone cares. Running > on a P4 and caring about performance are mutually exclusive.) >From the discussion off list which examined your benchmark code, it looks like we're going to move ahead with SipHash.
Re: HalfSipHash Acceptable Usage
Eric Dumazet wrote: > Now I am quite confused. > > George said : >> Cycles per byte on 1024 bytes of data: >> Pentium Core 2 Ivy >> 4 Duo Bridge >> SipHash-2-4 38.9 8.3 5.8 >> HalfSipHash-2-4 12.7 4.5 3.2 >> MD58.3 5.7 4.7 > > That really was for 1024 bytes blocks, so pretty much useless for our > discussion ? No, they're actually quite relevant, but you have to interpret them correctly. I thought I explained in the text following that table, but let me make it clearer: To find the time to compute the SipHash of N bytes, round (N+17) up to the next multiple of 8 bytes and multiply by the numbers above. To find the time to compute the HalfSipHash of N bytes, round (N+9) up to the next multiple of 4 bytes and multiply by the numbers above. To find the time to compute the MD5 of N bytes, round (N+9) up to the next multiple of 64 bytes and multiply by the numbers above. It's the different rounding rules that make all the difference. For small input blocks, SipHash can be slower per byte yet still faster because it hashes fewer bytes. > Reading your numbers last week, I thought SipHash was faster, but George > numbers are giving the opposite impression. SipHash annihilates the competition on 64-bit superscalar hardware. SipHash dominates the field on 64-bit in-order hardware. SipHash wins easily on 32-bit hardware *with enough registers*. On register-starved 32-bit machines, it really struggles. As I explained, in that last case, SipHash barely wins at all. (On a P4, it actually *loses* to MD5, not that anyone cares. Running on a P4 and caring about performance are mutually exclusive.)
Re: HalfSipHash Acceptable Usage
Linus wrote: >> How much does kernel_fpu_begin()/kernel_fpu_end() cost? > > It's now better than it used to be, but it's absolutely disastrous > still. We're talking easily many hundreds of cycles. Under some loads, > thousands. I think I've been thoroughly dissuaded, but just to clarify one thing that resembles a misunderstanding: > In contrast, in reality, especially with things like "do it once or > twice per incoming packet", you'll easily hit the absolute worst > cases, where not only does it take a few hundred cycles to save the FP > state, you'll then return to user space in between packets, which > triggers the slow-path return code and reloads the FP state, which is > another few hundred cycles plus. Everything being discussed is per-TCP-connection overhead, *not* per packet. (Twice for outgoing connections, because one is to generate the ephemeral port number.) I know you know this, but I don't want anyone spectating to be confused about it.
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin wrote: > > How much does kernel_fpu_begin()/kernel_fpu_end() cost? It's now better than it used to be, but it's absolutely disastrous still. We're talking easily many hundreds of cycles. Under some loads, thousands. And I warn you already: it will _benchmark_ a hell of a lot better than it will work in reality. In benchmarks, you'll hit all the optimizations ("oh, I've already saved away all the FP registers, no need to do it again"). In contrast, in reality, especially with things like "do it once or twice per incoming packet", you'll easily hit the absolute worst cases, where not only does it take a few hundred cycles to save the FP state, you'll then return to user space in between packets, which triggers the slow-path return code and reloads the FP state, which is another few hundred cycles plus. Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX unit and keep it powered up for a while afterwards", while in real life you would quite easily hit the "oh, AVX is powered down because we were idle, now it powers up at half speed which is another latency hit _and_ the AVX unit won't run full out anyway". Don't do it. There are basically no real situations where the AVX state optimizations help for the kernel. We just don't have the loop counts to make up for the problems it causes. The one exception is likely if you're doing things like high-throughput disk IO encryption, and then you'd be much better off using SHA256 instead (which often has hw encryption on modern CPU's - both x86 and ARM). (I'm sure that you could see it on some high-throughput network benchmark too when the benchmark entirely saturates the CPU. And then in real life it would suck horribly for all the reasons above). Linus
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
On Wed, 2016-12-21 at 11:39 -0500, Rik van Riel wrote: > Does anybody still have a P4? > > If they do, they're probably better off replacing > it with an Atom. The reduced power bills will pay > for replacing that P4 within a year or two. Well, maybe they have millions of units to replace. > > In short, I am not sure how important the P4 > performance numbers are, especially if we can > improve security for everybody else... Worth adding that the ISN or syncookie generation are less than 10% of the actual cost of handling a problematic (having to generate ISN or syncookie) TCP packet anyway. So we are talking of minors potential impact for '2000-era' cpus. Definitely I vote for using SipHash in TCP ASAP.
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
On Wed, 2016-12-21 at 10:55 -0500, George Spelvin wrote: > Actually, DJB just made a very relevant suggestion. > > As I've mentioned, the 32-bit performance problems are an x86- > specific > problem. ARM does very well, and other processors aren't bad at all. > > SipHash fits very nicely (and runs very fast) in the MMX registers. > > They're 64 bits, and there are 8 of them, so the integer registers > can > be reserved for pointers and loop counters and all that. And there's > reference code available. > > How much does kernel_fpu_begin()/kernel_fpu_end() cost? Those can be very expensive. Almost certainly not worth it for small amounts of data. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
On Wed, 2016-12-21 at 07:56 -0800, Eric Dumazet wrote: > On Wed, 2016-12-21 at 15:42 +0100, Jason A. Donenfeld wrote: > George said : > > > Cycles per byte on 1024 bytes of data: > > Pentium Core 2 Ivy > > 4 Duo Bridge > > SipHash-2-4 38.9 8.3 5.8 > > HalfSipHash-2-4 12.7 4.5 3.2 > > MD58.3 5.7 4.7 > > > That really was for 1024 bytes blocks, so pretty much useless for our > discussion ? > > Reading your numbers last week, I thought SipHash was faster, but > George > numbers are giving the opposite impression. > > I do not have a P4 to make tests, so I only can trust you or George. Does anybody still have a P4? If they do, they're probably better off replacing it with an Atom. The reduced power bills will pay for replacing that P4 within a year or two. In short, I am not sure how important the P4 performance numbers are, especially if we can improve security for everybody else... -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: HalfSipHash Acceptable Usage
Hi Eric, On Wed, Dec 21, 2016 at 4:56 PM, Eric Dumazet wrote: > That really was for 1024 bytes blocks, so pretty much useless for our > discussion ? > > Reading your numbers last week, I thought SipHash was faster, but George > numbers are giving the opposite impression. > > I do not have a P4 to make tests, so I only can trust you or George. I'm not sure how George came up with those numbers, but the ones I sent are output from that benchmark function in the last email. I'd be interested in learning this too. As mentioned in the last email, it looks like potential 32-bit issues are really just specific to old Intel chips. Other 32-bit architectures do fine. So, for new kernels, even if somehow there is a tiny performance regression (though I couldn't see one) on old architectures, I really doubt it will affect anybody in practice. Jason
Re: HalfSipHash Acceptable Usage
Hi George, On Wed, Dec 21, 2016 at 4:55 PM, George Spelvin wrote: > Actually, DJB just made a very relevant suggestion. > > As I've mentioned, the 32-bit performance problems are an x86-specific > problem. ARM does very well, and other processors aren't bad at all. > > SipHash fits very nicely (and runs very fast) in the MMX registers. > > They're 64 bits, and there are 8 of them, so the integer registers can > be reserved for pointers and loop counters and all that. And there's > reference code available. > > How much does kernel_fpu_begin()/kernel_fpu_end() cost? In my experience, these functions are only worth calling when processing more significant amounts of data. I don't have any benchmarks, but when I _remove_ all of these calls in a kernel, accelerated crypto gets noticeably faster (until the system crashes). We can measure it, though. By the way, if somehow SipHash becomes acceptably fast on x86, would you consider HalfSipHash for hash tables to be no longer needed? Or do you suspect that HalfSipHash will always be faster even on, say, 32-bit ARM. Jason
Re: HalfSipHash Acceptable Usage
On Wed, 2016-12-21 at 15:42 +0100, Jason A. Donenfeld wrote: > Hi Eric, > > I computed performance numbers for both 32-bit and 64-bit using the > actual functions in which talking about replacing MD5 with SipHash. > The basic harness is here [1] if you're curious. SipHash was a pretty > clear winner for both cases. > > x86_64: > [1.714302] secure_tcpv6_sequence_number_md5# cycles: 102373398 > [1.747685] secure_tcp_sequence_number_md5# cycles: 92042258 > [1.773522] secure_tcpv6_sequence_number_siphash# cycles: 70786533 > [1.798701] secure_tcp_sequence_number_siphash# cycles: 68941043 > > x86: > [1.635749] secure_tcpv6_sequence_number_md5# cycles: 106016335 > [1.670259] secure_tcp_sequence_number_md5# cycles: 95670512 > [1.708387] secure_tcpv6_sequence_number_siphash# cycles: 105988635 > [1.740264] secure_tcp_sequence_number_siphash# cycles: 88225395 > > >>> 102373398 > 70786533 > True > >>> 92042258 > 68941043 > True > >>> 106016335 > 105988635 > True > >>> 95670512 > 88225395 > True > > While MD5 is probably faster for some kind of large-data > cycles-per-byte, due to its 64-byte internal state, SipHash -- the > "Sip" part standing "Short Input PRF" -- is fast for shorter inputs. > In practice with the functions we're talking about replacing, there's > no need to hash 64-bytes. So, SipHash comes out faster and more > secure. > > I also haven't begun to look focusedly at the assembly my SipHash > implemention is generating, which means there's still window for even > more performance improvements. > > Jason > > > [1] > https://git.zx2c4.com/linux-dev/tree/net/core/secure_seq.c?h=siphash-bench#n194 Now I am quite confused. George said : > Cycles per byte on 1024 bytes of data: > Pentium Core 2 Ivy > 4 Duo Bridge > SipHash-2-4 38.9 8.3 5.8 > HalfSipHash-2-4 12.7 4.5 3.2 > MD58.3 5.7 4.7 That really was for 1024 bytes blocks, so pretty much useless for our discussion ? Reading your numbers last week, I thought SipHash was faster, but George numbers are giving the opposite impression. I do not have a P4 to make tests, so I only can trust you or George. Thanks.
Re: HalfSipHash Acceptable Usage
Actually, DJB just made a very relevant suggestion. As I've mentioned, the 32-bit performance problems are an x86-specific problem. ARM does very well, and other processors aren't bad at all. SipHash fits very nicely (and runs very fast) in the MMX registers. They're 64 bits, and there are 8 of them, so the integer registers can be reserved for pointers and loop counters and all that. And there's reference code available. How much does kernel_fpu_begin()/kernel_fpu_end() cost? Although there are a lot of pre-MMX x86es in embedded control applications, I don't think anyone is worried about their networking performance. (Specifically, all of this affects only connection setup, not throughput on established connections.)
Re: HalfSipHash Acceptable Usage
Hi Eric, I computed performance numbers for both 32-bit and 64-bit using the actual functions in which talking about replacing MD5 with SipHash. The basic harness is here [1] if you're curious. SipHash was a pretty clear winner for both cases. x86_64: [1.714302] secure_tcpv6_sequence_number_md5# cycles: 102373398 [1.747685] secure_tcp_sequence_number_md5# cycles: 92042258 [1.773522] secure_tcpv6_sequence_number_siphash# cycles: 70786533 [1.798701] secure_tcp_sequence_number_siphash# cycles: 68941043 x86: [1.635749] secure_tcpv6_sequence_number_md5# cycles: 106016335 [1.670259] secure_tcp_sequence_number_md5# cycles: 95670512 [1.708387] secure_tcpv6_sequence_number_siphash# cycles: 105988635 [1.740264] secure_tcp_sequence_number_siphash# cycles: 88225395 >>> 102373398 > 70786533 True >>> 92042258 > 68941043 True >>> 106016335 > 105988635 True >>> 95670512 > 88225395 True While MD5 is probably faster for some kind of large-data cycles-per-byte, due to its 64-byte internal state, SipHash -- the "Sip" part standing "Short Input PRF" -- is fast for shorter inputs. In practice with the functions we're talking about replacing, there's no need to hash 64-bytes. So, SipHash comes out faster and more secure. I also haven't begun to look focusedly at the assembly my SipHash implemention is generating, which means there's still window for even more performance improvements. Jason [1] https://git.zx2c4.com/linux-dev/tree/net/core/secure_seq.c?h=siphash-bench#n194
Re: HalfSipHash Acceptable Usage
Hi George, On Wed, Dec 21, 2016 at 7:34 AM, George Spelvin wrote: > In fact, I have an idea. Allow me to make the following concrete > suggestion for using HalfSipHash with 128 bits of key material: > > - 64 bits are used as the key. > - The other 64 bits are used as an IV which is prepended to > the message to be hashed. > > As a matter of practical implementation, we precompute the effect > of hashing the IV and store the 128-bit HalfSipHash state, which > is used just like a 128-bit key. > > Because of the way it is constructed, it is obviously no weaker than > standard HalfSipHash's 64-bit security claim. > > I don't know the security of this, and it's almost certainly weaker than > 128 bits, but I *hope* it's at least a few bits stronger than 64 bits. > 80 would be enough to dissuade any attacker without a six-figure budget > (that's per attack, not a one-time capital investment). 96 would be > ample for our purposes. > > What I do know is that it makes a brute-force attack without > significant cryptanalytic effort impossible. Depends who's doing the cryptanalytic effort I guess. Please don't roll your own crypto. It's a dangerous road. Putting homebrew crypto into the kernel would be an error. Let's stick with the constructions and security margins that the cryptographers give us. JP made that fairly clear, I thought. There are already people working on this problem who undergo peer review and a career devoted to solving these problems. One result for small systems that need 128-bit security is Chaskey, which you can go read about if you're curious. Jason
Re: HalfSipHash Acceptable Usage
Eric Dumazet wrote: > On Tue, 2016-12-20 at 22:28 -0500, George Spelvin wrote: >> Cycles per byte on 1024 bytes of data: >> Pentium Core 2 Ivy >> 4 Duo Bridge >> SipHash-2-4 38.9 8.3 5.8 >> HalfSipHash-2-4 12.7 4.5 3.2 >> MD5 8.3 5.7 4.7 > > So definitely not faster. > > 38 cycles per byte is a problem, considering IPV6 is ramping up. As I said earlier, SipHash performance on 32-bit x86 really sucks, because it wants an absolute minimum of 9 32-bit registers (8 for the state plus one temporary for the rotates), and x86 has 7. > What about SHA performance (syncookies) on P4 ? I recompiled with -mtune=pentium4 and re-ran. MD5 time went *up* by 0.3 cycles/byte, HalfSipHash went down by 1 cycle, and SipHash didn't change: Cycles per byte on 1024 bytes of data: Pentium Core 2 Ivy 4 Duo Bridge SipHash-2-4 38.9 8.3 5.8 HalfSipHash-2-4 11.5 4.5 3.2 MD5 8.6 5.7 4.7 SHA-1 19.0 8.0 6.8 (This is with a verbatim copy of the lib/sha1.c code; I might be able to optimize it with some asm hackery.) Anyway, you see why we were looking longingly at HalfSipHash. In fact, I have an idea. Allow me to make the following concrete suggestion for using HalfSipHash with 128 bits of key material: - 64 bits are used as the key. - The other 64 bits are used as an IV which is prepended to the message to be hashed. As a matter of practical implementation, we precompute the effect of hashing the IV and store the 128-bit HalfSipHash state, which is used just like a 128-bit key. Because of the way it is constructed, it is obviously no weaker than standard HalfSipHash's 64-bit security claim. I don't know the security of this, and it's almost certainly weaker than 128 bits, but I *hope* it's at least a few bits stronger than 64 bits. 80 would be enough to dissuade any attacker without a six-figure budget (that's per attack, not a one-time capital investment). 96 would be ample for our purposes. What I do know is that it makes a brute-force attack without significant cryptanalytic effort impossible. To match the spec exactly, we'd need to add the 8-byte IV length to the length byte which pads the final block, but from a security point of view, it does not matter. As long as we are consistent within any single key, any unique mapping between padding byte and message length (mod 256) is equally good. We may choose based on implementation convenience. (Also note my earlier comments about when it is okay to omit the padding length byte entirely: any time all the data to be hashed with a given key is fixed in format or self-delimiting (e.g. null-terminated). This applies to many of the networking uses.)
Re: HalfSipHash Acceptable Usage
On Tue, 2016-12-20 at 22:28 -0500, George Spelvin wrote: > > I do not see why SipHash, if faster than MD5 and more secure, would be a > > problem. > > Because on 32-bit x86, it's slower. > > Cycles per byte on 1024 bytes of data: > Pentium Core 2 Ivy > 4 Duo Bridge > SipHash-2-4 38.9 8.3 5.8 > HalfSipHash-2-4 12.7 4.5 3.2 > MD58.3 5.7 4.7 So definitely not faster. 38 cycles per byte is a problem, considering IPV6 is ramping up. But TCP session establishment on P4 is probably not a big deal. Nobody would expect a P4 to handle gazillions of TCP flows (using a 32bit kernel) What about SHA performance (syncookies) on P4 ? Synfloods are probably the only case we might take care of for 2000-era cpus.
Re: HalfSipHash Acceptable Usage
> I do not see why SipHash, if faster than MD5 and more secure, would be a > problem. Because on 32-bit x86, it's slower. Cycles per byte on 1024 bytes of data: Pentium Core 2 Ivy 4 Duo Bridge SipHash-2-4 38.9 8.3 5.8 HalfSipHash-2-4 12.7 4.5 3.2 MD5 8.3 5.7 4.7 SipHash is more parallelizable and runs faster on superscalar processors, but MD5 is optimized for 2000-era processors, and is faster on them than HalfSipHash even. Now, in the applications we care about, we're hashing short blocks, and SipHash has the advantage that it can hash less than 64 bytes. But it also pays a penalty on short blocks for the finalization, equivalent to two words (16 bytes) of input. It turns out that on both Ivy Bridge and Core 2 Duo, the crossover happens between 23 (SipHash is faster) and 24 (MD5 is faster) bytes of input. This is assuming you're adding the 1 byte of length padding to SipHash's input, so 24 bytes pads to 4 64-bit words, which makes 2*4+4 = 12 rounds, vs. one block for MD5. (MD5 takes a similar jump between 55 and 56 bytes.) On a P4, SipHash is *never* faster; it takes 2.5x longer than MD5 on a 12-byte block (an IPv4 address/port pair). This is why there was discussion of using HalfSipHash on these machines. (On a P4, the HalfSipHash/MD5 crossover is somewhere between 24 and 31 bytes; I haven't benchmarked every possible size.)
Re: HalfSipHash Acceptable Usage
On Tue, 2016-12-20 at 16:36 -0500, Theodore Ts'o wrote: > On Mon, Dec 19, 2016 at 06:32:44PM +0100, Jason A. Donenfeld wrote: > > 1) Anything that requires actual long-term security will use > > SipHash2-4, with the 64-bit output and the 128-bit key. This includes > > things like TCP sequence numbers. This seems pretty uncontroversial to > > me. Seem okay to you? > > Um, why do TCP sequence numbers need long-term security? So long as > you rekey every 5 minutes or so, TCP sequence numbers don't need any > more security than that, since even if you break the key used to > generate initial sequence numbers seven a minute or two later, any > pending TCP connections will have timed out long before. > > See the security analysis done in RFC 6528[1], where among other > things, it points out why MD5 is acceptable with periodic rekeying, > although there is the concern that this could break certain hueristics > used when establishing new connections during the TIME-WAIT state. > > [1] https://tools.ietf.org/html/rfc6528 We do not use rekeying for TCP ISN, not anymore after commit 6e5714eaf77d79ae1 (where we switched from MD4 to MD5 ) It might hurt some common cases and I do not believe it is mandated by a current (ie not obsolete) RFC. Our clock has a 64 ns resolution and 274 second period (commit 9b42c336d0641) (compared to 4 usec one in RFC 6528) I do not see why SipHash, if faster than MD5 and more secure, would be a problem. Same for syncookies. BTW, we probably should add a ratelimit on SYNACK retransmits, because it seems that attackers understood linux kernels resist to synfloods, and they (the bad guys) use reflection attacks.
Re: HalfSipHash Acceptable Usage
Theodore Ts'o wrote: > On Mon, Dec 19, 2016 at 06:32:44PM +0100, Jason A. Donenfeld wrote: >> 1) Anything that requires actual long-term security will use >> SipHash2-4, with the 64-bit output and the 128-bit key. This includes >> things like TCP sequence numbers. This seems pretty uncontroversial to >> me. Seem okay to you? > Um, why do TCP sequence numbers need long-term security? So long as > you rekey every 5 minutes or so, TCP sequence numbers don't need any > more security than that, since even if you break the key used to > generate initial sequence numbers seven a minute or two later, any > pending TCP connections will have timed out long before. > > See the security analysis done in RFC 6528[1], where among other > things, it points out why MD5 is acceptable with periodic rekeying, > although there is the concern that this could break certain hueristics > used when establishing new connections during the TIME-WAIT state. Because we don't rekey TCP sequence numbers, ever. See commit 6e5714eaf77d79ae1c8b47e3e040ff5411b717ec To rekey them requires dividing the sequence number base into a "random" part and some "generation" msbits. While we can do better than the previous 8+24 split (I'd suggest 4+28 or 3+29), only 2 is tricks, and 1 generation bit isn't enough. So while it helps in the long term, it reduces the security offered by the random part in the short term. (If I know 4 bits of your ISN, I only need to send 256 MB to hit your TCP window.) At the time, I objected, and suggested doing two hashes, with a fixed 32-bit base plus a split rekeyed portion, but that was vetoed on the grounds of performance. On further consideration, the fixed base doesn't help much. (Details below for anyone that cares.) Suppose we let the TCP initial sequence number be: (Hash(, fixed_key) & 0x) + (i << 28) + (Hash(, key[i]) & 0x0fff) + (current_time_in_nanoseconds / 64) It's not hugely difficult to mount an effective attack against a 64-bit fixed_key. As an attacker, I can ask the target to send me these numbers for dstPort values i control and other values I know. I can (with high probability) detect the large jumps when the generation changes, so I can make a significant number of queries with the same generation. After 23-ish queries, I have enough information to identify a 64-bit fixed_key. I don't know the current generation counter "i", but I know it's the same for all my queries, so for any two queries, the maximum difference between the 28-bit hash values is 29 bits. (We can also add a small margin to allow for timeing uncertainty, but that's even less.) So if I guess a fixed key, hash my known plaintexts with that guess, subtract the ciphertexts from the observed sequence numbers, and the difference between the remaining (unknown) 28-bit hash values plus timestamps exceeds what's possible, my guess is wrong. I can then repeat with additional known plaintexts, reducing the space of admissible keys by about 3 bits each time. Assuming I can rent GPU horsepower from a bitcoin miner to do this in a reasonable period of time, after 22 known plaintext differences, I have uniquely identified the key. Of course, in practice I'd do is a first pass with maybe 6 plaintexts on the GPU, and then deal with the candidates found in a second pass. But either way, it's about 2.3 SipHash evaluations per key tested. As I noted earlier, a bitcoin blockchain block, worth 25 bitcoins, currently costs 2^71 evaluations of SHA-2 (2^70 evaluations of double SHA-2), and that's accomplished every 10 minutes, this is definitely practical.
Re: HalfSipHash Acceptable Usage
On Mon, Dec 19, 2016 at 06:32:44PM +0100, Jason A. Donenfeld wrote: > 1) Anything that requires actual long-term security will use > SipHash2-4, with the 64-bit output and the 128-bit key. This includes > things like TCP sequence numbers. This seems pretty uncontroversial to > me. Seem okay to you? Um, why do TCP sequence numbers need long-term security? So long as you rekey every 5 minutes or so, TCP sequence numbers don't need any more security than that, since even if you break the key used to generate initial sequence numbers seven a minute or two later, any pending TCP connections will have timed out long before. See the security analysis done in RFC 6528[1], where among other things, it points out why MD5 is acceptable with periodic rekeying, although there is the concern that this could break certain hueristics used when establishing new connections during the TIME-WAIT state. [1] https://tools.ietf.org/html/rfc6528 - Ted
Re: HalfSipHash Acceptable Usage
Hi JP, On Mon, Dec 19, 2016 at 9:49 PM, Jean-Philippe Aumasson wrote: > > On Mon, Dec 19, 2016 at 6:32 PM Jason A. Donenfeld wrote: >> >> Hi JP, >> >> With the threads getting confusing, I've been urged to try and keep >> the topics and threads more closely constrained. Here's where we're >> at, and here's the current pressing security concern. It'd be helpful >> to have a definitive statement on what you think is best, so we can >> just build on top of that, instead of getting lost in the chorus of >> opinions. >> >> 1) Anything that requires actual long-term security will use >> SipHash2-4, with the 64-bit output and the 128-bit key. This includes >> things like TCP sequence numbers. This seems pretty uncontroversial to >> me. Seem okay to you? > > > > Right, since 2012 when we published SipHash many cryptanalysts attempted to > break SipHash-2-4 with a 128-bit key, for various notions of "break", and > nothing worth worrying was ever found. I'm totally confident that > SipHash-2-4 will live up to its security promises. > > Don't use something weaker for things like TCP sequence numbers or RNGs. Use > SipHash2-4 for those. That is the correct choice. > >> >> >> 2) People seem to want something competitive, performance-wise, with >> jhash if it's going to replace jhash. The kernel community >> instinctively pushes back on anything that could harm performance, >> especially in networking and in critical data structures, so there >> have been some calls for something faster than SipHash. So, questions >> regarding this: >> > > No free lunch I guess: either go with a cryptographically secure, > time-proved keyed hash such as SipHash, or go with some simpler hash deemed > secure cos its designer can't break it :) #DontRollYourOwnCrypto > >> 2a) George thinks that HalfSipHash on 32-bit systems will have roughly >> comparable speed as SipHash on 64-bit systems, so the idea would be to >> use HalfSipHash on 32-bit systems' hash tables and SipHash on 64-bit >> systems' hash tables. The big obvious question is: does HalfSipHash >> have a sufficient security margin for hashtable usage and hashtable >> attacks? I'm not wondering about the security margin for other usages, >> but just of the hashtable usage. In your opinion, does HalfSipHash cut >> it? > > > HalfSipHash takes its core function from Chaskey and uses the same > construction as SipHash, so it *should* be secure. Nonetheless it hasn't > received the same amount of attention as 64-bit SipHash did. So I'm less > confident about its security than about SipHash's, but it obviously inspires > a lot more confidence than non-crypto hashes. > > Too, HalfSipHash only has a 64-bit key, not a 128-bit key like SipHash, so > only use this as a mitigation for hash-flooding attacks, where the output of > the hash function is never directly shown to the caller. Do not use > HalfSipHash for TCP sequence numbers or RNGs. > > >> >> >> 2b) While I certainly wouldn't consider making the use case in >> question (1) employ a weaker function, for this question (2), there >> has been some discussion about using HalfSipHash1-3 (or SipHash1-3 on >> 64-bit) instead of 2-4. So, the same question is therefore posed: >> would using HalfSipHash1-3 give a sufficient security margin for >> hashtable usage and hashtable attacks? > > > My educated guess is that yes, it will, but that it may not withhold > cryptanalysis as a pseudorandom function (PRF). For example I wouldn't be > surprised if there were a "distinguishing attack" that detects non-random > patterns in HalfSipHash-1-3's output. But most of the non-crypto hashes I've > seen have obvious distinguishing attacks. So the upshot is that HSH will get > you better security that AnyWeakHash even with 1 & 3 rounds. > > So, if you're willing to compromise on security, but still want something > not completely unreasonable, you might be able to get away with using > HalfSipHash1-3 as a replacement for jhash—in circumstances where the output > of the hash function is kept secret—in order to mitigate hash-flooding > attacks. > Thanks for the detailed response. I will continue exactly how you've specified. 1. SipHash2-4 for TCP sequence numbers, syncookies, and RNG. IOW, the things that MD5 is used for now. 2. HalfSipHash1-3 for hash tables where the output is not revealed, for jhash replacements. On 64-bit this will alias to SipHash1-3. 3. I will write Documentation/siphash.txt detailing this. 4. I'll continue to discourage other kernel developers from rolling their own crypto or departing from the tried&true in substantial ways. Thanks again, Jason
HalfSipHash Acceptable Usage
Hi JP, With the threads getting confusing, I've been urged to try and keep the topics and threads more closely constrained. Here's where we're at, and here's the current pressing security concern. It'd be helpful to have a definitive statement on what you think is best, so we can just build on top of that, instead of getting lost in the chorus of opinions. 1) Anything that requires actual long-term security will use SipHash2-4, with the 64-bit output and the 128-bit key. This includes things like TCP sequence numbers. This seems pretty uncontroversial to me. Seem okay to you? 2) People seem to want something competitive, performance-wise, with jhash if it's going to replace jhash. The kernel community instinctively pushes back on anything that could harm performance, especially in networking and in critical data structures, so there have been some calls for something faster than SipHash. So, questions regarding this: 2a) George thinks that HalfSipHash on 32-bit systems will have roughly comparable speed as SipHash on 64-bit systems, so the idea would be to use HalfSipHash on 32-bit systems' hash tables and SipHash on 64-bit systems' hash tables. The big obvious question is: does HalfSipHash have a sufficient security margin for hashtable usage and hashtable attacks? I'm not wondering about the security margin for other usages, but just of the hashtable usage. In your opinion, does HalfSipHash cut it? 2b) While I certainly wouldn't consider making the use case in question (1) employ a weaker function, for this question (2), there has been some discussion about using HalfSipHash1-3 (or SipHash1-3 on 64-bit) instead of 2-4. So, the same question is therefore posed: would using HalfSipHash1-3 give a sufficient security margin for hashtable usage and hashtable attacks? My plan is essentially to implement things according to your security recommendation. The thread started with me pushing a heavy duty security solution -- SipHash2-4 -- for _everything_. I've received understandable push back on that notion for certain use cases. So now I'm trying to discover what the most acceptable compromise is. Your answers on (2a) and (2b) will direct that compromise. Thanks again, Jason