Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5
Hi Ted, On Thu, Dec 22, 2016 at 6:41 AM, Theodore Ts'owrote: > The bottom line is that I think we're really "pixel peeping" at this > point --- which is what obsessed digital photographers will do when > debating the quality of a Canon vs Nikon DSLR by blowing up a photo by > a thousand times, and then trying to claim that this is visible to the > human eye. Or people who obsessing over the frequency response curves > of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's > likely that in a blind head-to-head comparison, most people wouldn't > be able to tell the difference This is hilarious, thanks for the laugh. I believe you're right about this... > > I think the main argument for using the batched getrandom approach is > that it, I would argue, simpler than introducing siphash into the > picture. On 64-bit platforms it is faster and more consistent, so > it's basically that versus complexity of having to adding siphash to > the things that people have to analyze when considering random number > security on Linux. But it's a close call either way, I think. I find this compelling. We'll have one csprng for both get_random_int/long and for /dev/urandom, and we'll be able to update that silly warning on the comment of get_random_int/long to read "gives output of either rdrand quality or of /dev/urandom quality", which makes it more useful for other things. It introduces less error prone code, and it lets the RNG analysis be spent on just one RNG, not two. So, with your blessing, I'm going to move ahead with implementing a pretty version of this for v8. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On Wed, Dec 21, 2016 at 9:01 PM, George Spelvinwrote: > Andy Lutomirski wrote: >> I don't even think it needs that. This is just adding a >> non-destructive final operation, right? > > It is, but the problem is that SipHash is intended for *small* inputs, > so the standard implementations aren't broken into init/update/final > functions. > > There's just one big function that keeps the state variables in > registers and never stores them anywhere. > > If we *had* init/update/final functions, then it would be trivial. > >> Just to clarify, if we replace SipHash with a black box, I think this >> effectively means, where "entropy" is random_get_entropy() || jiffies >> || current->pid: > >> The first call returns H(random seed || entropy_0 || secret). The >> second call returns H(random seed || entropy_0 || secret || entropy_1 >> || secret). Etc. > > Basically, yes. I was skipping the padding byte and keying the > finalization rounds on the grounds of "can't hurt and might help", > but we could do it a more standard way. > >> If not, then I have a fairly strong preference to keep whatever >> construction we come up with consistent with something that could >> actually happen with invocations of unmodified SipHash -- then all the >> security analysis on SipHash goes through. > > Okay. I don't think it makes a difference, but it's not a *big* waste > of time. If we have finalization rounds, we can reduce the secret > to 128 bits. > > If we include the padding byte, we can do one of two things: > 1) Make the secret 184 bits, to fill up the final partial word as >much as possible, or > 2) Make the entropy 1 byte smaller and conceptually misalign the >secret. What we'd actually do is remove the last byte of >the secret and include it in the entropy words, but that's >just a rotation of the secret between storage and hashing. > > Also, I assume you'd like SipHash-2-4, since you want to rely > on a security analysis. I haven't looked, but I assume that the analysis at least thought about reduced rounds, so maybe other variants are okay. >> The one thing I don't like is >> that I don't see how to prove that you can't run it backwards if you >> manage to acquire a memory dump. In fact, I that that there exist, at >> least in theory, hash functions that are secure in the random oracle >> model but that *can* be run backwards given the full state. From >> memory, SHA-3 has exactly that property, and it would be a bit sad for >> a CSPRNG to be reversible. > > Er... get_random_int() is specifically *not* designed to be resistant > to state capture, and I didn't try. Remember, what it's used for > is ASLR, what we're worried about is somene learning the layouts > of still-running processes, and and if you get a memory dump, you have > the memory layout! True, but it's called get_random_int(), and it seems like making it stronger, especially if the performance cost is low to zero, is a good thing. > > If you want anti-backtracking, though, it's easy to add. What we > hash is: > > entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ... > > You mix the output word right back in to the (unfinalized) state after > generating it. This is still equivalent to unmodified back-box SipHash, > you're just using a (conceptually independent) SipHash invocation to > produce some of its input. Ah, cute. This could probably be sped up by doing something like: entropy_0 || secret || output_0 ^ entropy_1 || secret || ... It's a little weak because the output is only 64 bits, so you could plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the old entropy is guessable. I suspect there are sneaky ways around it like using output_n-1 ^ output_n-2 or similar. I'll sleep on it. > > The only remaining issues are: > 1) How many rounds, and > 2) May we use HalfSipHash? I haven't looked closely enough to have a real opinion here. I don't know what the security margin is believed to be. > > I'd *like* to persuade you that skipping the padding byte wouldn't > invalidate any security proofs, because it's true and would simplify > the code. But if you want 100% stock, I'm willing to cater to that. I lean toward stock in the absence of a particularly good reason. At the very least I'd want to read that paper carefully. > > Ted, what do you think? -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5
On Thu, Dec 22, 2016 at 03:49:39AM +0100, Jason A. Donenfeld wrote: > > Funny -- while you guys were sending this back & forth, I was writing > my reply to Andy which essentially arrives at the same conclusion. > Given that we're all arriving to the same thing, and that Ted shot in > this direction long before we all did, I'm leaning toward abandoning > SipHash for the de-MD5-ification of get_random_int/long, and working > on polishing Ted's idea into something shiny for this patchset. here are my numbers comparing siphash (using the first three patches of the v7 siphash patches) with my batched chacha20 implementation. The results are taken by running get_random_* 1 times, and then dividing the numbers by 1 to get the average number of cycles for the call. I compiled 32-bit and 64-bit kernels, and ran the results using kvm: siphashbatched chacha20 get_random_int get_random_long get_random_int get_random_long 32-bit270 278 114146 64-bit 75 75 106186 > I did have two objections to this. The first was that my SipHash > construction is faster. Well, it's faster on everything except 32-bit x86. :-P > The second, and the more > important one, was that batching entropy up like this means that 32 > calls will be really fast, and then the 33rd will be slow, since it > has to do a whole ChaCha round, because get_random_bytes must be > called to refill the batch. ... and this will take 2121 cycles on 64-bit x86, and 2315 cycles on a 32-bit x86. Which on a 2.3 GHz processor, is just under a microsecond. As far as being inconsistent on process startup, I very much doubt a microsecond is really going to be visible to the user. :-) The bottom line is that I think we're really "pixel peeping" at this point --- which is what obsessed digital photographers will do when debating the quality of a Canon vs Nikon DSLR by blowing up a photo by a thousand times, and then trying to claim that this is visible to the human eye. Or people who obsessing over the frequency response curves of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's likely that in a blind head-to-head comparison, most people wouldn't be able to tell the difference I think the main argument for using the batched getrandom approach is that it, I would argue, simpler than introducing siphash into the picture. On 64-bit platforms it is faster and more consistent, so it's basically that versus complexity of having to adding siphash to the things that people have to analyze when considering random number security on Linux. But it's a close call either way, I think. - Ted P.S. My benchmarking code diff --git a/drivers/char/random.c b/drivers/char/random.c index a51f0ff43f00..41860864b775 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1682,6 +1682,55 @@ static int rand_initialize(void) } early_initcall(rand_initialize); +static unsigned int get_random_int_new(void); +static unsigned long get_random_long_new(void); + +#define NUM_CYCLES 1 +#define AVG(finish, start) ((unsigned int)(finish - start + NUM_CYCLES/2) / NUM_CYCLES) + +static int rand_benchmark(void) +{ + cycles_t start,finish; + int i, out; + + pr_crit("random benchmark!!\n"); + start = get_cycles(); + for (i = 0; i < NUM_CYCLES; i++) { + get_random_int();} + finish = get_cycles(); + pr_err("get_random_int # cycles: %u\n", AVG(finish, start)); + + start = get_cycles(); + for (i = 0; i < NUM_CYCLES; i++) { + get_random_int_new(); + } + finish = get_cycles(); + pr_err("get_random_int_new (batched chacha20) # cycles: %u\n", AVG(finish, start)); + + start = get_cycles(); + for (i = 0; i < NUM_CYCLES; i++) { + get_random_long(); + } + finish = get_cycles(); + pr_err("get_random_long # cycles: %u\n", AVG(finish, start)); + + start = get_cycles(); + for (i = 0; i < NUM_CYCLES; i++) { + get_random_long_new(); + } + finish = get_cycles(); + pr_err("get_random_long_new (batched chacha20) # cycles: %u\n", AVG(finish, start)); + + start = get_cycles(); + for (i = 0; i < NUM_CYCLES; i++) { + get_random_bytes(, sizeof(out)); + } + finish = get_cycles(); + pr_err("get_random_bytes # cycles: %u\n", AVG(finish, start)); + return 0; +} +device_initcall(rand_benchmark); + #ifdef CONFIG_BLOCK void rand_initialize_disk(struct gendisk *disk) { @@ -2064,8 +2113,10 @@ unsigned int get_random_int(void) unsigned int ret; u64 *chaining; +#if 0 // force slow path if (arch_get_random_int()) return ret; +#endif chaining = _cpu_var(get_random_int_chaining); ret = *chaining =
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
Andy Lutomirski wrote: > I don't even think it needs that. This is just adding a > non-destructive final operation, right? It is, but the problem is that SipHash is intended for *small* inputs, so the standard implementations aren't broken into init/update/final functions. There's just one big function that keeps the state variables in registers and never stores them anywhere. If we *had* init/update/final functions, then it would be trivial. > Just to clarify, if we replace SipHash with a black box, I think this > effectively means, where "entropy" is random_get_entropy() || jiffies > || current->pid: > The first call returns H(random seed || entropy_0 || secret). The > second call returns H(random seed || entropy_0 || secret || entropy_1 > || secret). Etc. Basically, yes. I was skipping the padding byte and keying the finalization rounds on the grounds of "can't hurt and might help", but we could do it a more standard way. > If not, then I have a fairly strong preference to keep whatever > construction we come up with consistent with something that could > actually happen with invocations of unmodified SipHash -- then all the > security analysis on SipHash goes through. Okay. I don't think it makes a difference, but it's not a *big* waste of time. If we have finalization rounds, we can reduce the secret to 128 bits. If we include the padding byte, we can do one of two things: 1) Make the secret 184 bits, to fill up the final partial word as much as possible, or 2) Make the entropy 1 byte smaller and conceptually misalign the secret. What we'd actually do is remove the last byte of the secret and include it in the entropy words, but that's just a rotation of the secret between storage and hashing. Also, I assume you'd like SipHash-2-4, since you want to rely on a security analysis. (Regarding the padding byte, getting it right might be annoying to do exactly. All of the security analysis depends *only* on its low 3 bits indicating how much of the final block is used. As it says in the SipHash paper, they included 8 bits just because it was easy. But if you want it exact, it's just one more byte of state.) > The one thing I don't like is > that I don't see how to prove that you can't run it backwards if you > manage to acquire a memory dump. In fact, I that that there exist, at > least in theory, hash functions that are secure in the random oracle > model but that *can* be run backwards given the full state. From > memory, SHA-3 has exactly that property, and it would be a bit sad for > a CSPRNG to be reversible. Er... get_random_int() is specifically *not* designed to be resistant to state capture, and I didn't try. Remember, what it's used for is ASLR, what we're worried about is somene learning the layouts of still-running processes, and and if you get a memory dump, you have the memory layout! If you want anti-backtracking, though, it's easy to add. What we hash is: entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ... You mix the output word right back in to the (unfinalized) state after generating it. This is still equivalent to unmodified back-box SipHash, you're just using a (conceptually independent) SipHash invocation to produce some of its input. Each output is produced by copying the state, padding & finalizing after the secret. In fact, to make our lives easier, let's define the secret to end with a counter byte that happens to be equal to the padding byte. The input stream will be: Previous output: 8 (or 4 for HalfSipHash) bytes Entropy: 15 bytes (8 bytes timer, 4 bytes jiffies, 3 bytes pid) Secret: 16 bytes Counter: 1 byte ...repeat... > We could also periodically mix in a big (128-bit?) chunk of fresh > urandom output to keep the bad guys guessing. Simpler and faster to just update the global master secret. The state is per-CPU, so mixing in has to be repeated per CPU. With these changes, I'm satisifed that it's secure, cheap, has a sufficiently wide state size, *and* all standard SipHash analysis applies. The only remaining issues are: 1) How many rounds, and 2) May we use HalfSipHash? I'd *like* to persuade you that skipping the padding byte wouldn't invalidate any security proofs, because it's true and would simplify the code. But if you want 100% stock, I'm willing to cater to that. Ted, what do you think? -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
Hi George, On Thu, Dec 22, 2016 at 4:55 AM, George Spelvinwrote: > Do we have to go through this? No, the benchmark was *not* bogus. > Then I replaced the kernel #includes with the necessary typedefs > and #defines to make it compile in user-space. > * I didn't iterate 100K times, I timed the functions *once*. > * I saved the times in a buffer and printed them all at the end > so printf() wouldn't pollute the caches. > * Before every even-numbered iteration, I flushed the I-cache > of everything from _init to _fini (i.e. all the non-library code). > This cold-cache case is what is going to happen in the kernel. Wow! Great. Thanks for the pointers on the right way to do this. Very helpful, and enlightening results indeed. Think you could send me the whole .c of what you finally came up with? I'd like to run this on some other architectures; I've got a few odd boxes laying around here. > The P4 results were: > SipHash actually wins slightly in the cold-cache case, because > it iterates more. In the hot-cache case, it loses > Core 2 duo: > Pretty much a tie, honestly. > Ivy Bridge: > Modern processors *hate* cold caches. But notice how md5 is *faster* > than SipHash on hot-cache IPv6. > Ivy Bridge, -m64: > Of course, when you compile -m64, SipHash is unbeatable. Okay, so I think these results are consistent with some of the assessments from before -- that SipHash is really just fine as a replacement for MD5. Not great on older 32-bit x86, but not too horrible, and the performance improvements on every other architecture and the security improvements everywhere are a net good. > Here's the modified benchmark() code. The entire package is > a bit voluminous for the mailing list, but anyone is welcome to it. Please do send! I'm sure I'll learn from reading it. Thanks again for doing the hardwork of putting something proper together. Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
> Plus the benchmark was bogus anyway, and when I built a more specific > harness -- actually comparing the TCP sequence number functions -- > SipHash was faster than MD5, even on register starved x86. So I think > we're fine and this chapter of the discussion can come to a close, in > order to move on to more interesting things. Do we have to go through this? No, the benchmark was *not* bogus. Here's myresults from *your* benchmark. I can't reboot some of my test machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and include/linux/siphash.h straight out of your test tree. Then I replaced the kernel #includes with the necessary typedefs and #defines to make it compile in user-space. (Voluminous but straightforward.) E.g. #define __aligned(x) __attribute__((__aligned__(x))) #define cacheline_aligned __aligned(64) #define CONFIG_INET 1 #define IS_ENABLED(x) 1 #define ktime_get_real_ns() 0 #define sysctl_tcp_timestamps 0 ... etc. Then I modified your benchmark code into the appended code. The differences are: * I didn't iterate 100K times, I timed the functions *once*. * I saved the times in a buffer and printed them all at the end so printf() wouldn't pollute the caches. * Before every even-numbered iteration, I flushed the I-cache of everything from _init to _fini (i.e. all the non-library code). This cold-cache case is what is going to happen in the kernel. In the results below, note that I did *not* re-flush between phases of the test. The effects of cacheing is clearly apparent in the tcpv4 results, where the tcpv6 code loaded the cache. You can also see that the SipHash code benefits more from cacheing when entered with a cold cache, as it iterates over the input words, while the MD5 code is one big unrolled blob. Order of computation is down the columns first, across second. The P4 results were: tcpv6 md5 cold: 40843488358435843568 tcpv4 md5 cold: 1052 996 9961060 996 tcpv6 siphash cold: 40803296331232963312 tcpv4 siphash cold: 29682748297227162716 tcpv6 md5 hot: 900 712 712712 712 tcpv4 md5 hot: 632 672 672672 672 tcpv6 siphash hot: 24842292234023402340 tcpv4 siphash hot: 16601560156423401564 SipHash actually wins slightly in the cold-cache case, because it iterates more. In the hot-cache case, it loses horribly. Core 2 duo: tcpv6 md5 cold: 33962868296430122832 tcpv4 md5 cold: 13681044132013321308 tcpv6 siphash cold: 29402952291624482604 tcpv4 siphash cold: 31922988357635043624 tcpv6 md5 hot: 11161032 99610081008 tcpv4 md5 hot: 936 936 936 936 936 tcpv6 siphash hot: 12001236123611881188 tcpv4 siphash hot: 936 804 804 804 804 Pretty much a tie, honestly. Ivy Bridge: tcpv6 md5 cold: 60866136696263586060 tcpv4 md5 cold: 816 732104610541012 tcpv6 siphash cold: 37561886215223902566 tcpv4 siphash cold: 32642108302631203526 tcpv6 md5 hot: 1062 808 824 824 832 tcpv4 md5 hot: 730 730 740 748 748 tcpv6 siphash hot: 960 952 9361112 926 tcpv4 siphash hot: 638 544 562 552 560 Modern processors *hate* cold caches. But notice how md5 is *faster* than SipHash on hot-cache IPv6. Ivy Bridge, -m64: tcpv6 md5 cold: 46803672395636163525 tcpv4 md5 cold: 10661416117911791134 tcpv6 siphash cold: 9401258199516092255 tcpv4 siphash cold: 14401269129218701621 tcpv6 md5 hot: 1372112210881088 tcpv4 md5 hot: 997 997 997 997 998 tcpv6 siphash hot: 340 340 340 352 340 tcpv4 siphash hot: 227 238 238 238 238 Of course, when you compile -m64, SipHash is unbeatable. Here's the modified benchmark() code. The entire package is a bit voluminous for the mailing list, but anyone is welcome to it. static void clflush(void) { extern char const _init, _fini; char const *p = &_init; while (p < &_fini) { asm("clflush %0" : : "m" (*p)); p += 64; } } typedef uint32_t cycles_t; static cycles_t get_cycles(void) { uint32_t eax, edx; asm volatile("rdtsc" : "=a" (eax), "=d" (edx)); return eax; } static int benchmark(void) { cycles_t start, finish; int i; u32 seq_number = 0; __be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 222, 0xff10 }; __be32 saddr4 = 2, daddr4 = 182112; __be16 sport = 22, dport = 41992;
Re: [PATCH v7 3/6] random: use SipHash in place of MD5
On Thu, Dec 22, 2016 at 3:49 AM, Jason A. Donenfeldwrote: > I did have two objections to this. The first was that my SipHash > construction is faster. But in any case, they're both faster than the > current MD5, so it's just extra rice. The second, and the more > important one, was that batching entropy up like this means that 32 > calls will be really fast, and then the 33rd will be slow, since it > has to do a whole ChaCha round, because get_random_bytes must be > called to refill the batch. Since get_random_long is called for every > process startup, I didn't really like there being inconsistent > performance on process startup. And I'm pretty sure that one ChaCha > whole block is slower than computing MD5, even though it lasts 32 > times as long, though I need to measure this. But maybe that's dumb in > the end? Are these concerns that should point us toward the > determinism (and speed) of SipHash? Are these concerns that don't > matter and so we should roll with the simplicity of reusing ChaCha? I ran some measurements in order to quantify what I'm talking about. Repeatedly running md5_transform is about 2.3 times faster than repeatedly running extract_crng. What does this mean? One call to extract_crng gives us 32 times as many longs as one call to md5_transform. This means that spread over 32 process creations, chacha will be 13.9 times faster. However, every 32nd process will take 2.3 times as long to generate its ASLR value as it would with the old md5_transform code. Personally, I don't think that 2.3 is a big deal. And I really like how much this simplifies the analysis. But if it's a big deal to you, then we can continue to discuss my SipHash construction, which gives faster and more consistent performance, at the cost of a more complicated and probably less impressive security analysis. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/6] random: use SipHash in place of MD5
Hi Andy & Hannes, On Thu, Dec 22, 2016 at 3:07 AM, Hannes Frederic Sowawrote: > I wonder if Ted's proposal was analyzed further in terms of performance > if get_random_int should provide cprng alike properties? > > For reference: https://lkml.org/lkml/2016/12/14/351 > > The proposal made sense to me and would completely solve the above > mentioned problem on the cost of repeatedly reseeding from the crng. On Thu, Dec 22, 2016 at 3:09 AM, Andy Lutomirski wrote: > Unless I've misunderstood it, Ted's proposal causes get_random_int() > to return bytes straight from urandom (effectively), which should make > it very strong. And if urandom is competitively fast now, I don't see > the problem. ChaCha20 is designed for speed, after all. Funny -- while you guys were sending this back & forth, I was writing my reply to Andy which essentially arrives at the same conclusion. Given that we're all arriving to the same thing, and that Ted shot in this direction long before we all did, I'm leaning toward abandoning SipHash for the de-MD5-ification of get_random_int/long, and working on polishing Ted's idea into something shiny for this patchset. I did have two objections to this. The first was that my SipHash construction is faster. But in any case, they're both faster than the current MD5, so it's just extra rice. The second, and the more important one, was that batching entropy up like this means that 32 calls will be really fast, and then the 33rd will be slow, since it has to do a whole ChaCha round, because get_random_bytes must be called to refill the batch. Since get_random_long is called for every process startup, I didn't really like there being inconsistent performance on process startup. And I'm pretty sure that one ChaCha whole block is slower than computing MD5, even though it lasts 32 times as long, though I need to measure this. But maybe that's dumb in the end? Are these concerns that should point us toward the determinism (and speed) of SipHash? Are these concerns that don't matter and so we should roll with the simplicity of reusing ChaCha? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
> On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin >> After some thinking, I still like the "state-preserving" construct >> that's equivalent to the current MD5 code. Yes, we could just do >> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to >> preserve a bit more. >> >> It requires library support from the SipHash code to return the full >> SipHash state, but I hope that's a fair thing to ask for. This is not a good idea. If I understand correctly, the idea here is to just keep around SipHash's internal state variables, and chain them over to the next call, sort of like how md5_transform with the current code works on the same scratch space. There has been no security analysis in the literature on this use of the primitive, and I have no confidence that this is a secure use of the function. Unless somebody can point me toward a paper I missed or a comment from a real cryptographer about the specifics of SipHash, I think I'm right to admonish against this dangerous road. Let's talk about constructions. And let's only decide on a construction that we're actually equipped to analyze. Let's definitely not talk about making our own primitives, or retrofitting nice primitive's internals into our own Frankenstein. Alternatively, if I'm wrong, please send me an eprint/arXiv link to a paper that discusses this use of SipHash. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/6] random: use SipHash in place of MD5
Hi Andy, On Thu, Dec 22, 2016 at 12:42 AM, Andy Lutomirskiwrote: > So this is probably good enough, and making it better is hard. Changing it > to: > > u64 entropy = (u64)random_get_entropy() + current->pid; > result = siphash(..., entropy, ...); > secret->chaining += result + entropy; > > would reduce this problem by forcing an attacker to brute-force the > entropy on each iteration, which is probably an improvement. Ahh, so that's the reasoning behind a similar suggestion of yours in a previous email. Makes sense to me. I'll include this in the next merge if we don't come up with a different idea before then. Your reasoning seems good for it. Part of what makes this process a bit goofy is that it's not all together clear what the design goals are. Right now we're going for "not worse than before", which we've nearly achieved. How good of an RNG do we want? I'm willing to examine and analyze the security and performance of all constructions we can come up with. One thing I don't want to do, however, is start tweaking the primitives themselves in ways not endorsed by the designers. So, I believe that precludes things like carrying over SipHash's internal state (like what was done with MD5), because there hasn't been a formal security analysis of this like there has with other uses of SipHash. I also don't want to change any internals of how SipHash actually works. I mention that because of some of the suggestions on other threads, which make me rather uneasy. So with that said, while writing this reply to you, I was simultaneously reading some other crypto code and was reminded that there's a variant of SipHash which outputs an additional 64-bits; it's part of the siphash reference code, which they call the "128-bit mode". It has the benefit that we can return 64-bits to the caller and save 64-bits for the chaining key. That way there's no correlation between the returned secret and the chaining key, which I think would completely alleviate all of your concerns, and simplify the analysis a bit. Here's what it looks like: https://git.zx2c4.com/linux-dev/commit/?h=siphash=46fbe5b408e66b2d16b4447860f8083480e1c08d The downside is that it takes 4 extra Sip rounds. This puts the performance still better than MD5, though, and likely still better than the other batched entropy solution. We could optimize this, I suppose, by giving it only two parameters -- chaining, jiffies+entropy+pid -- instead of the current three -- chaining, jiffies, entropy+pid -- which would then shave off 2 Sip rounds. But I liked the idea of having a bit more spread in the entropy input field. Anyway, with this in mind, we now have three possibilities: 1. result = siphash(chaining, entropy, key); chaining += result + entropy 2. result = siphash_extra_output(chaining, entropy, key, ); 3. Ted's batched entropy idea using chacha20 The more I think about this, the more I suspect that we should just use chacha20. It will still be faster than MD5. I don't like the non-determinism of it (some processes will start slower than others, if the batched entropy has run out and ASLR demands more), but I guess I can live with that. But, most importantly, it greatly simplifies both the security analysis and what we can promise to callers about the function. Right now in the comment documentation, we're coy with callers about the security of the RNG. If we moved to a known construction like chacha20/get_random_bytes_batched, then we could just be straight up with a promise that the numbers it returns are high quality. Thoughts on 2 and 3, and on 1 vs 2 vs 3? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/6] random: use SipHash in place of MD5
On Wed, Dec 21, 2016 at 6:07 PM, Hannes Frederic Sowawrote: > On 22.12.2016 00:42, Andy Lutomirski wrote: >> On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld wrote: >>> unsigned int get_random_int(void) >>> { >>> - __u32 *hash; >>> - unsigned int ret; >>> - >>> - if (arch_get_random_int()) >>> - return ret; >>> - >>> - hash = get_cpu_var(get_random_int_hash); >>> - >>> - hash[0] += current->pid + jiffies + random_get_entropy(); >>> - md5_transform(hash, random_int_secret); >>> - ret = hash[0]; >>> - put_cpu_var(get_random_int_hash); >>> - >>> - return ret; >>> + unsigned int arch_result; >>> + u64 result; >>> + struct random_int_secret *secret; >>> + >>> + if (arch_get_random_int(_result)) >>> + return arch_result; >>> + >>> + secret = get_random_int_secret(); >>> + result = siphash_3u64(secret->chaining, jiffies, >>> + (u64)random_get_entropy() + current->pid, >>> + secret->secret); >>> + secret->chaining += result; >>> + put_cpu_var(secret); >>> + return result; >>> } >>> EXPORT_SYMBOL(get_random_int); >> >> Hmm. I haven't tried to prove anything for real. But here goes (in >> the random oracle model): >> >> Suppose I'm an attacker and I don't know the secret or the chaining >> value. Then, regardless of what the entropy is, I can't predict the >> numbers. >> >> Now suppose I do know the secret and the chaining value due to some >> leak. If I want to deduce prior outputs, I think I'm stuck: I'd need >> to find a value "result" such that prev_chaining + result = chaining >> and result = H(prev_chaining, ..., secret);. I don't think this can >> be done efficiently in the random oracle model regardless of what the >> "..." is. >> >> But, if I know the secret and chaining value, I can predict the next >> output assuming I can guess the entropy. What's worse is that, even >> if I can't guess the entropy, if I *observe* the next output then I >> can calculate the next chaining value. >> >> So this is probably good enough, and making it better is hard. Changing it >> to: >> >> u64 entropy = (u64)random_get_entropy() + current->pid; >> result = siphash(..., entropy, ...); >> secret->chaining += result + entropy; >> >> would reduce this problem by forcing an attacker to brute-force the >> entropy on each iteration, which is probably an improvement. >> >> To fully fix it, something like "catastrophic reseeding" would be >> needed, but that's hard to get right. > > I wonder if Ted's proposal was analyzed further in terms of performance > if get_random_int should provide cprng alike properties? > > For reference: https://lkml.org/lkml/2016/12/14/351 > > The proposal made sense to me and would completely solve the above > mentioned problem on the cost of repeatedly reseeding from the crng. > Unless I've misunderstood it, Ted's proposal causes get_random_int() to return bytes straight from urandom (effectively), which should make it very strong. And if urandom is competitively fast now, I don't see the problem. ChaCha20 is designed for speed, after all. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
On Wed, Dec 21, 2016 at 5:13 PM, George Spelvinwrote: > As a separate message, to disentangle the threads, I'd like to > talk about get_random_long(). > > After some thinking, I still like the "state-preserving" construct > that's equivalent to the current MD5 code. Yes, we could just do > siphash(current_cpu || per_cpu_counter, global_key), but it's nice to > preserve a bit more. > > It requires library support from the SipHash code to return the full > SipHash state, but I hope that's a fair thing to ask for. I don't even think it needs that. This is just adding a non-destructive final operation, right? > > Here's my current straw man design for comment. It's very similar to > the current MD5-based design, but feeds all the seed material in the > "correct" way, as opposed to Xring directly into the MD5 state. > > * Each CPU has a (Half)SipHash state vector, > "unsigned long get_random_int_hash[4]". Unlike the current > MD5 code, we take care to initialize it to an asymmetric state. > > * There's a global 256-bit random_int_secret (which we could > reseed periodically). > > To generate a random number: > * If get_random_int_hash is all-zero, seed it with fresh a half-sized > SipHash key and the appropriate XOR constants. > * Generate three words of random_get_entropy(), jiffies, and current->pid. > (This is arbitary seed material, copied from the current code.) > * Crank through that with (Half)SipHash-1-0. > * Crank through the random_int_secret with (Half)SipHash-1-0. > * Return v1 ^ v3. Just to clarify, if we replace SipHash with a black box, I think this effectively means, where "entropy" is random_get_entropy() || jiffies || current->pid: The first call returns H(random seed || entropy_0 || secret). The second call returns H(random seed || entropy_0 || secret || entropy_1 || secret). Etc. If not, then I have a fairly strong preference to keep whatever construction we come up with consistent with something that could actually happen with invocations of unmodified SipHash -- then all the security analysis on SipHash goes through. Anyway, I have mixed thoughts about the construction. It manages to have a wide state at essentially no cost, which buys us quite a bit of work factor to break it. Even with full knowledge of the state, an output doesn't reveal the entropy except to the extent that it can be brute-force (this is just whatever the appropriate extended version of first preimage resistance gives us). The one thing I don't like is that I don't see how to prove that you can't run it backwards if you manage to acquire a memory dump. In fact, I that that there exist, at least in theory, hash functions that are secure in the random oracle model but that *can* be run backwards given the full state. From memory, SHA-3 has exactly that property, and it would be a bit sad for a CSPRNG to be reversible. We could also periodically mix in a big (128-bit?) chunk of fresh urandom output to keep the bad guys guessing. (P.S. This kind of resembles the duplex sponge construction. If hardware SHA-3 ever shows up, a duplex sponge RNG might nice indeed.) -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/6] random: use SipHash in place of MD5
On 22.12.2016 00:42, Andy Lutomirski wrote: > On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeldwrote: >> unsigned int get_random_int(void) >> { >> - __u32 *hash; >> - unsigned int ret; >> - >> - if (arch_get_random_int()) >> - return ret; >> - >> - hash = get_cpu_var(get_random_int_hash); >> - >> - hash[0] += current->pid + jiffies + random_get_entropy(); >> - md5_transform(hash, random_int_secret); >> - ret = hash[0]; >> - put_cpu_var(get_random_int_hash); >> - >> - return ret; >> + unsigned int arch_result; >> + u64 result; >> + struct random_int_secret *secret; >> + >> + if (arch_get_random_int(_result)) >> + return arch_result; >> + >> + secret = get_random_int_secret(); >> + result = siphash_3u64(secret->chaining, jiffies, >> + (u64)random_get_entropy() + current->pid, >> + secret->secret); >> + secret->chaining += result; >> + put_cpu_var(secret); >> + return result; >> } >> EXPORT_SYMBOL(get_random_int); > > Hmm. I haven't tried to prove anything for real. But here goes (in > the random oracle model): > > Suppose I'm an attacker and I don't know the secret or the chaining > value. Then, regardless of what the entropy is, I can't predict the > numbers. > > Now suppose I do know the secret and the chaining value due to some > leak. If I want to deduce prior outputs, I think I'm stuck: I'd need > to find a value "result" such that prev_chaining + result = chaining > and result = H(prev_chaining, ..., secret);. I don't think this can > be done efficiently in the random oracle model regardless of what the > "..." is. > > But, if I know the secret and chaining value, I can predict the next > output assuming I can guess the entropy. What's worse is that, even > if I can't guess the entropy, if I *observe* the next output then I > can calculate the next chaining value. > > So this is probably good enough, and making it better is hard. Changing it > to: > > u64 entropy = (u64)random_get_entropy() + current->pid; > result = siphash(..., entropy, ...); > secret->chaining += result + entropy; > > would reduce this problem by forcing an attacker to brute-force the > entropy on each iteration, which is probably an improvement. > > To fully fix it, something like "catastrophic reseeding" would be > needed, but that's hard to get right. I wonder if Ted's proposal was analyzed further in terms of performance if get_random_int should provide cprng alike properties? For reference: https://lkml.org/lkml/2016/12/14/351 The proposal made sense to me and would completely solve the above mentioned problem on the cost of repeatedly reseeding from the crng. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvaldswrote: > On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin > wrote: >> >> How much does kernel_fpu_begin()/kernel_fpu_end() cost? > > It's now better than it used to be, but it's absolutely disastrous > still. We're talking easily many hundreds of cycles. Under some loads, > thousands. > > And I warn you already: it will _benchmark_ a hell of a lot better > than it will work in reality. In benchmarks, you'll hit all the > optimizations ("oh, I've already saved away all the FP registers, no > need to do it again"). > > In contrast, in reality, especially with things like "do it once or > twice per incoming packet", you'll easily hit the absolute worst > cases, where not only does it take a few hundred cycles to save the FP > state, you'll then return to user space in between packets, which > triggers the slow-path return code and reloads the FP state, which is > another few hundred cycles plus. Hah, you're thinking that the x86 code works the way that Rik and I want it to work, and you just made my day. :) What actually happens is that the state is saved in kernel_fpu_begin() and restored in kernel_fpu_end(), and it'll take a few hundred cycles best case. If you do it a bunch of times in a loop, you *might* trigger a CPU optimization that notices that the state being saved is the same state that was just restored, but you're still going to pay the full restore code each round trip no matter what. The code is much clearer in 4.10 kernels now that I deleted the unused "lazy" branches. > > Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX > unit and keep it powered up for a while afterwards", while in real > life you would quite easily hit the "oh, AVX is powered down because > we were idle, now it powers up at half speed which is another latency > hit _and_ the AVX unit won't run full out anyway". I *think* that was mostly fixed in Broadwell or thereabouts (in terms of latency -- throughput and power consumption still suffers). -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/6] siphash: add cryptographically secure PRF
On Thu, Dec 22, 2016 at 2:40 AM, Stephen Hemmingerwrote: > The networking tree (net-next) which is where you are submitting to is > technically > closed right now. That's okay. At some point in the future it will be open. By then v83 of this patch set will be shiny and done, just waiting for the merge window to open. There's a lot to discuss with this, so getting the feedback early is beneficial. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/6] siphash: add cryptographically secure PRF
On Thu, 22 Dec 2016 00:02:11 +0100 "Jason A. Donenfeld"wrote: > SipHash is a 64-bit keyed hash function that is actually a > cryptographically secure PRF, like HMAC. Except SipHash is super fast, > and is meant to be used as a hashtable keyed lookup function, or as a > general PRF for short input use cases, such as sequence numbers or RNG > chaining. > > For the first usage: > > There are a variety of attacks known as "hashtable poisoning" in which an > attacker forms some data such that the hash of that data will be the > same, and then preceeds to fill up all entries of a hashbucket. This is > a realistic and well-known denial-of-service vector. Currently > hashtables use jhash, which is fast but not secure, and some kind of > rotating key scheme (or none at all, which isn't good). SipHash is meant > as a replacement for jhash in these cases. > > There are a modicum of places in the kernel that are vulnerable to > hashtable poisoning attacks, either via userspace vectors or network > vectors, and there's not a reliable mechanism inside the kernel at the > moment to fix it. The first step toward fixing these issues is actually > getting a secure primitive into the kernel for developers to use. Then > we can, bit by bit, port things over to it as deemed appropriate. > > While SipHash is extremely fast for a cryptographically secure function, > it is likely a bit slower than the insecure jhash, and so replacements > will be evaluated on a case-by-case basis based on whether or not the > difference in speed is negligible and whether or not the current jhash usage > poses a real security risk. > > For the second usage: > > A few places in the kernel are using MD5 or SHA1 for creating secure > sequence numbers, syn cookies, port numbers, or fast random numbers. > SipHash is a faster and more fitting, and more secure replacement for MD5 > in those situations. Replacing MD5 and SHA1 with SipHash for these uses is > obvious and straight-forward, and so is submitted along with this patch > series. There shouldn't be much of a debate over its efficacy. > > Dozens of languages are already using this internally for their hash > tables and PRFs. Some of the BSDs already use this in their kernels. > SipHash is a widely known high-speed solution to a widely known set of > problems, and it's time we catch-up. > > Signed-off-by: Jason A. Donenfeld > Cc: Jean-Philippe Aumasson > Cc: Linus Torvalds > Cc: Eric Biggers > Cc: David Laight > Cc: Eric Dumazet The networking tree (net-next) which is where you are submitting to is technically closed right now. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
As a separate message, to disentangle the threads, I'd like to talk about get_random_long(). After some thinking, I still like the "state-preserving" construct that's equivalent to the current MD5 code. Yes, we could just do siphash(current_cpu || per_cpu_counter, global_key), but it's nice to preserve a bit more. It requires library support from the SipHash code to return the full SipHash state, but I hope that's a fair thing to ask for. Here's my current straw man design for comment. It's very similar to the current MD5-based design, but feeds all the seed material in the "correct" way, as opposed to Xring directly into the MD5 state. * Each CPU has a (Half)SipHash state vector, "unsigned long get_random_int_hash[4]". Unlike the current MD5 code, we take care to initialize it to an asymmetric state. * There's a global 256-bit random_int_secret (which we could reseed periodically). To generate a random number: * If get_random_int_hash is all-zero, seed it with fresh a half-sized SipHash key and the appropriate XOR constants. * Generate three words of random_get_entropy(), jiffies, and current->pid. (This is arbitary seed material, copied from the current code.) * Crank through that with (Half)SipHash-1-0. * Crank through the random_int_secret with (Half)SipHash-1-0. * Return v1 ^ v3. Here are the reasons: * The first step is just paranoia, but SipHash's security promise depends on starting with an asymmetric state, we want unique per-CPU states, and it's a one-time cost. * When the input words are themselves secret, there's no security advantage, and almost no speed advantage, to doing two rounds for one input word versus two words with one round each. Thus, SipHash-1. * The above is not exactly true, due to the before+after XOR pattern that SipHash uses, but I think it's true anyway. * Likewise, there's no benefit to unkeyed finalization rounds over keyed ones. That's why I just enlarged the global secret. * The per-call seed material is hashed first on general principles, because that's the novel part that might have fresh entropy. * To the extent the initial state is secret, the rounds processing the global secret are 4 finalization rounds for the initial state and the per-call entropy. * The final word(s) of the global secret might be vulnerable to analysis, due to incomplete mixing, but since the global secret is always hashed in the same order, and larger that the desired security level, the initial words should be secure. * By carrying forward the full internal state, we ensure that repeated calls return different results, and to the extent that the per-call seed material has entropy, it's preserved. * The final return is all that's needed, since the last steps in the SipRound are "v1 ^= v2" and "v3 ^= v0". It's no security loss, and a very minor speedup. * Also, this avoids directly "exposing" the final XOR with the last word of the global secret (which is made to v0). If I'm allowed to use full SipHash, some shortcuts can be taken, but I believe the above would be secure with HalfSipHash. If additional performance is required, I'd consider shrinking the global secret to 192 bits on 32-bit machines but I want more than 128 bits of ey material, and enough rounds to be equivalent to 4 finalization rounds. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables
> 64-bit x86_64: > [0.509409] test_siphash: SipHash2-4 cycles: 4049181 > [0.510650] test_siphash: SipHash1-3 cycles: 2512884 > [0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920 > [0.512904] test_siphash:JenkinsHash cycles: 978267 I'm not sure what these numbers mean. Surely a single siphash2-4 does not take 4+ million cycles? If you run them in a loop please divide by the iterations. But generally running small code in a loop is often an unrealistic benchmark strategy because it hides cache misses, primes predictors, changes frequencies and changes memory costs, but also can overload pipelines and oversubscribe resources. [see also page 46+ in http://halobates.de/applicative-mental-models.pdf] So the numbers you get there are at least somewhat dubious. It would be good to have at least some test which is not just a tiny micro benchmark to compare before making conclusions. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
Theodore Ts'o wrote: > On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote: >> SipHash annihilates the competition on 64-bit superscalar hardware. >> SipHash dominates the field on 64-bit in-order hardware. >> SipHash wins easily on 32-bit hardware *with enough registers*. >> On register-starved 32-bit machines, it really struggles. > And "with enough registers" includes ARM and MIPS, right? Yes. As a matter of fact, 32-bit ARM does particularly well on 64-bit SipHash due to its shift+op instructions. There is a noticeable performance drop, but nothing catastrophic. The main thing I've been worried about is all the flow tracking and NAT done by small home routers, and that's addressed by using HalfSipHash for the hash tables. They don't *initiate* a lot of TCP sessions. > So the only > real problem is 32-bit x86, and you're right, at that point, only > people who might care are people who are using a space-radiation > hardened 386 --- and they're not likely to be doing high throughput > TCP connections. :-) The only requirement on performance is "don't make DaveM angry." :-) I was just trying to answer the question of why we *worried* about the performance, not specifically argue that we *should* use HalfSipHash. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/6] random: use SipHash in place of MD5
On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeldwrote: > unsigned int get_random_int(void) > { > - __u32 *hash; > - unsigned int ret; > - > - if (arch_get_random_int()) > - return ret; > - > - hash = get_cpu_var(get_random_int_hash); > - > - hash[0] += current->pid + jiffies + random_get_entropy(); > - md5_transform(hash, random_int_secret); > - ret = hash[0]; > - put_cpu_var(get_random_int_hash); > - > - return ret; > + unsigned int arch_result; > + u64 result; > + struct random_int_secret *secret; > + > + if (arch_get_random_int(_result)) > + return arch_result; > + > + secret = get_random_int_secret(); > + result = siphash_3u64(secret->chaining, jiffies, > + (u64)random_get_entropy() + current->pid, > + secret->secret); > + secret->chaining += result; > + put_cpu_var(secret); > + return result; > } > EXPORT_SYMBOL(get_random_int); Hmm. I haven't tried to prove anything for real. But here goes (in the random oracle model): Suppose I'm an attacker and I don't know the secret or the chaining value. Then, regardless of what the entropy is, I can't predict the numbers. Now suppose I do know the secret and the chaining value due to some leak. If I want to deduce prior outputs, I think I'm stuck: I'd need to find a value "result" such that prev_chaining + result = chaining and result = H(prev_chaining, ..., secret);. I don't think this can be done efficiently in the random oracle model regardless of what the "..." is. But, if I know the secret and chaining value, I can predict the next output assuming I can guess the entropy. What's worse is that, even if I can't guess the entropy, if I *observe* the next output then I can calculate the next chaining value. So this is probably good enough, and making it better is hard. Changing it to: u64 entropy = (u64)random_get_entropy() + current->pid; result = siphash(..., entropy, ...); secret->chaining += result + entropy; would reduce this problem by forcing an attacker to brute-force the entropy on each iteration, which is probably an improvement. To fully fix it, something like "catastrophic reseeding" would be needed, but that's hard to get right. (An aside: on x86 at least, using two percpu variables is faster because directly percpu access is essentially free, whereas getting the address of a percpu variable is not free.) -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/6] random: use SipHash in place of MD5
Hi Ted, On Thu, Dec 22, 2016 at 12:02 AM, Jason A. Donenfeldwrote: > This duplicates the current algorithm for get_random_int/long I should have mentioned this directly in the commit message, which I forgot to update: this v7 adds the time-based key rotation, which, while not strictly necessary for ensuring the security of the RNG, might help alleviate some concerns, as we talked about. Performance is quite good on both 32-bit and 64-bit -- better than MD5 in both cases. If you like this, terrific. If not, I'm happy to take this in whatever direction you prefer, and implement whatever construction you think best. There's been a lot of noise on this list about it; we can continue to discuss more, or you can just tell me whatever you want to do, and I'll implement it and that'll be the end of it. As you said, we can always get something decent now and improve it later. Alternatively, if you've decided in the end you prefer your batched entropy approach using chacha, I'm happy to implement a polished version of that here in this patch series (so that we can keep the `rm lib/md5.c` commit.) Just let me know how you'd like to proceed. Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables
HalfSipHash, or hsiphash, is a shortened version of SipHash, which generates 32-bit outputs using a weaker 64-bit key. It has *much* lower security margins, and shouldn't be used for anything too sensitive, but it could be used as a hashtable key function replacement, if the output is never exposed, and if the security requirement is not too high. The goal is to make this something that performance-critical jhash users would be willing to use. On 64-bit machines, HalfSipHash1-3 is slower than SipHash1-3, so we alias SipHash1-3 to HalfSipHash1-3 on those systems. 64-bit x86_64: [0.509409] test_siphash: SipHash2-4 cycles: 4049181 [0.510650] test_siphash: SipHash1-3 cycles: 2512884 [0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920 [0.512904] test_siphash:JenkinsHash cycles: 978267 So, we map hsiphash() -> SipHash1-3 32-bit x86: [0.509868] test_siphash: SipHash2-4 cycles: 14812892 [0.513601] test_siphash: SipHash1-3 cycles: 9510710 [0.515263] test_siphash: HalfSipHash1-3 cycles: 3856157 [0.515952] test_siphash:JenkinsHash cycles: 1148567 So, we map hsiphash() -> HalfSipHash1-3 hsiphash() is roughly 3 times slower than jhash(), but comes with a considerable security improvement. Signed-off-by: Jason A. DonenfeldCc: Jean-Philippe Aumasson --- Documentation/siphash.txt | 75 +++ include/linux/siphash.h | 56 +++- lib/siphash.c | 318 +- lib/test_siphash.c| 139 4 files changed, 561 insertions(+), 27 deletions(-) diff --git a/Documentation/siphash.txt b/Documentation/siphash.txt index 39ff7f0438e7..f93c1d7104c4 100644 --- a/Documentation/siphash.txt +++ b/Documentation/siphash.txt @@ -77,3 +77,78 @@ Linux implements the "2-4" variant of SipHash. Read the SipHash paper if you're interested in learning more: https://131002.net/siphash/siphash.pdf + + +~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~ + +HalfSipHash - SipHash's insecure younger cousin +--- +Written by Jason A. Donenfeld + +On the off-chance that SipHash is not fast enough for your needs, you might be +able to justify using HalfSipHash, a terrifying but potentially useful +possibility. HalfSipHash cuts SipHash's rounds down from "2-4" to "1-3" and, +even scarier, uses an easily brute-forcable 64-bit key (with a 32-bit output) +instead of SipHash's 128-bit key. However, this may appeal to some +high-performance `jhash` users. + +Danger! + +Do not ever use HalfSipHash except for as a hashtable key function, and only +then when you can be absolutely certain that the outputs will never be +transmitted out of the kernel. This is only remotely useful over `jhash` as a +means of mitigating hashtable flooding denial of service attacks. + +1. Generating a key + +Keys should always be generated from a cryptographically secure source of +random numbers, either using get_random_bytes or get_random_once: + +hsiphash_key_t key; +get_random_bytes(key, sizeof(key)); + +If you're not deriving your key from here, you're doing it wrong. + +2. Using the functions + +There are two variants of the function, one that takes a list of integers, and +one that takes a buffer: + +u32 hsiphash(const void *data, size_t len, siphash_key_t key); + +And: + +u32 hsiphash_1u32(u32, hsiphash_key_t key); +u32 hsiphash_2u32(u32, u32, hsiphash_key_t key); +u32 hsiphash_3u32(u32, u32, u32, hsiphash_key_t key); +u32 hsiphash_4u32(u32, u32, u32, u32, hsiphash_key_t key); + +If you pass the generic hsiphash function something of a constant length, it +will constant fold at compile-time and automatically choose one of the +optimized functions. + +3. Hashtable key function usage: + +struct some_hashtable { + DECLARE_HASHTABLE(hashtable, 8); + hsiphash_key_t key; +}; + +void init_hashtable(struct some_hashtable *table) +{ + get_random_bytes(table->key, sizeof(table->key)); +} + +static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input) +{ + return >hashtable[hsiphash(input, sizeof(*input), table->key) & (HASH_SIZE(table->hashtable) - 1)]; +} + +You may then iterate like usual over the returned hash bucket. + +4. Performance + +HalfSipHash is roughly 3 times slower than JenkinsHash. For many replacements, +this will not be a problem, as the hashtable lookup isn't the bottleneck. And +in general, this is probably a good sacrifice to make for the security and DoS +resistance of HalfSipHash. diff --git a/include/linux/siphash.h b/include/linux/siphash.h index 7aa666eb00d9..efab44c654f3 100644 --- a/include/linux/siphash.h +++ b/include/linux/siphash.h @@ -5,7 +5,9 @@ * SipHash: a fast short-input PRF * https://131002.net/siphash/ * - * This implementation is specifically for SipHash2-4. + * This
[PATCH v7 5/6] syncookies: use SipHash in place of SHA1
SHA1 is slower and less secure than SipHash, and so replacing syncookie generation with SipHash makes natural sense. Some BSDs have been doing this for several years in fact. The speedup should be similar -- and even more impressive -- to the speedup from the sequence number fix in this series. Signed-off-by: Jason A. DonenfeldCc: Eric Dumazet Cc: David Miller --- net/ipv4/syncookies.c | 20 net/ipv6/syncookies.c | 37 - 2 files changed, 20 insertions(+), 37 deletions(-) diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index 3e88467d70ee..03bb068f 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -13,13 +13,13 @@ #include #include #include -#include +#include #include #include #include #include -static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly; +static siphash_key_t syncookie_secret[2] __read_mostly; #define COOKIEBITS 24 /* Upper bits store count */ #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1) @@ -48,24 +48,12 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly; #define TSBITS 6 #define TSMASK (((__u32)1 << TSBITS) - 1) -static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS], ipv4_cookie_scratch); - static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, u32 count, int c) { - __u32 *tmp; - net_get_random_once(syncookie_secret, sizeof(syncookie_secret)); - - tmp = this_cpu_ptr(ipv4_cookie_scratch); - memcpy(tmp + 4, syncookie_secret[c], sizeof(syncookie_secret[c])); - tmp[0] = (__force u32)saddr; - tmp[1] = (__force u32)daddr; - tmp[2] = ((__force u32)sport << 16) + (__force u32)dport; - tmp[3] = count; - sha_transform(tmp + 16, (__u8 *)tmp, tmp + 16 + 5); - - return tmp[17]; + return siphash_4u32(saddr, daddr, (u32)sport << 16 | dport, count, + syncookie_secret[c]); } diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index a4d49760bf43..be51fc0d99ad 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -16,7 +16,7 @@ #include #include -#include +#include #include #include #include @@ -24,7 +24,7 @@ #define COOKIEBITS 24 /* Upper bits store count */ #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1) -static u32 syncookie6_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly; +static siphash_key_t syncookie6_secret[2] __read_mostly; /* RFC 2460, Section 8.3: * [ipv6 tcp] MSS must be computed as the maximum packet size minus 60 [..] @@ -41,30 +41,25 @@ static __u16 const msstab[] = { 9000 - 60, }; -static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS], ipv6_cookie_scratch); - static u32 cookie_hash(const struct in6_addr *saddr, const struct in6_addr *daddr, __be16 sport, __be16 dport, u32 count, int c) { - __u32 *tmp; + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + u32 count; + u16 sport; + u16 dport; + } __aligned(SIPHASH_ALIGNMENT) combined = { + .saddr = *saddr, + .daddr = *daddr, + .count = count, + .sport = sport, + .dport = dport + }; net_get_random_once(syncookie6_secret, sizeof(syncookie6_secret)); - - tmp = this_cpu_ptr(ipv6_cookie_scratch); - - /* -* we have 320 bits of information to hash, copy in the remaining -* 192 bits required for sha_transform, from the syncookie6_secret -* and overwrite the digest with the secret -*/ - memcpy(tmp + 10, syncookie6_secret[c], 44); - memcpy(tmp, saddr, 16); - memcpy(tmp + 4, daddr, 16); - tmp[8] = ((__force u32)sport << 16) + (__force u32)dport; - tmp[9] = count; - sha_transform(tmp + 16, (__u8 *)tmp, tmp + 16 + 5); - - return tmp[17]; + return siphash(, offsetofend(typeof(combined), dport), syncookie6_secret[c]); } static __u32 secure_tcp_syn_cookie(const struct in6_addr *saddr, -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 2/6] secure_seq: use SipHash in place of MD5
This gives a clear speed and security improvement. Siphash is both faster and is more solid crypto than the aging MD5. Rather than manually filling MD5 buffers, for IPv6, we simply create a layout by a simple anonymous struct, for which gcc generates rather efficient code. For IPv4, we pass the values directly to the short input convenience functions. 64-bit x86_64: [1.683628] secure_tcpv6_sequence_number_md5# cycles: 99563527 [1.717350] secure_tcp_sequence_number_md5# cycles: 92890502 [1.741968] secure_tcpv6_sequence_number_siphash# cycles: 67825362 [1.762048] secure_tcp_sequence_number_siphash# cycles: 67485526 32-bit x86: [1.600012] secure_tcpv6_sequence_number_md5# cycles: 103227892 [1.634219] secure_tcp_sequence_number_md5# cycles: 94732544 [1.669102] secure_tcpv6_sequence_number_siphash# cycles: 96299384 [1.700165] secure_tcp_sequence_number_siphash# cycles: 86015473 Signed-off-by: Jason A. DonenfeldCc: Andi Kleen Cc: David Miller Cc: David Laight Cc: Tom Herbert Cc: Hannes Frederic Sowa Cc: Eric Dumazet --- net/core/secure_seq.c | 135 -- 1 file changed, 54 insertions(+), 81 deletions(-) diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c index 88a8e429fc3e..3dc2689bcc64 100644 --- a/net/core/secure_seq.c +++ b/net/core/secure_seq.c @@ -1,3 +1,5 @@ +/* Copyright (C) 2016 Jason A. Donenfeld . All Rights Reserved. */ + #include #include #include @@ -8,14 +10,14 @@ #include #include #include - +#include #include #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET) +#include #include -#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4) -static u32 net_secret[NET_SECRET_SIZE] cacheline_aligned; +static siphash_key_t net_secret; static __always_inline void net_secret_init(void) { @@ -44,80 +46,65 @@ static u32 seq_scale(u32 seq) u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr, __be16 sport, __be16 dport, u32 *tsoff) { - u32 secret[MD5_MESSAGE_BYTES / 4]; - u32 hash[MD5_DIGEST_WORDS]; - u32 i; - + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + __be16 sport; + __be16 dport; + } __aligned(SIPHASH_ALIGNMENT) combined = { + .saddr = *(struct in6_addr *)saddr, + .daddr = *(struct in6_addr *)daddr, + .sport = sport, + .dport = dport + }; + u64 hash; net_secret_init(); - memcpy(hash, saddr, 16); - for (i = 0; i < 4; i++) - secret[i] = net_secret[i] + (__force u32)daddr[i]; - secret[4] = net_secret[4] + - (((__force u16)sport << 16) + (__force u16)dport); - for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++) - secret[i] = net_secret[i]; - - md5_transform(hash, secret); - - *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0; - return seq_scale(hash[0]); + hash = siphash(, offsetofend(typeof(combined), dport), net_secret); + *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0; + return seq_scale(hash); } EXPORT_SYMBOL(secure_tcpv6_sequence_number); u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport) { - u32 secret[MD5_MESSAGE_BYTES / 4]; - u32 hash[MD5_DIGEST_WORDS]; - u32 i; - + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + __be16 dport; + } __aligned(SIPHASH_ALIGNMENT) combined = { + .saddr = *(struct in6_addr *)saddr, + .daddr = *(struct in6_addr *)daddr, + .dport = dport + }; net_secret_init(); - memcpy(hash, saddr, 16); - for (i = 0; i < 4; i++) - secret[i] = net_secret[i] + (__force u32) daddr[i]; - secret[4] = net_secret[4] + (__force u32)dport; - for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++) - secret[i] = net_secret[i]; - - md5_transform(hash, secret); - - return hash[0]; + return siphash(, offsetofend(typeof(combined), dport), net_secret); } EXPORT_SYMBOL(secure_ipv6_port_ephemeral); #endif #ifdef CONFIG_INET +/* secure_tcp_sequence_number(a, b, 0, d) == secure_ipv4_port_ephemeral(a, b, d), + * but fortunately, `sport' cannot be 0 in any circumstances. If this changes, + * it would be easy enough to have the former function use siphash_4u32, passing + * the arguments as separate u32. + */ + u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, u32 *tsoff) { - u32 hash[MD5_DIGEST_WORDS]; - + u64 hash;
[PATCH v7 3/6] random: use SipHash in place of MD5
This duplicates the current algorithm for get_random_int/long, but uses siphash instead. This comes with several benefits. It's certainly faster and more cryptographically secure than MD5. This patch also separates hashed fields into three values instead of one, in order to increase diffusion. The previous MD5 algorithm used a per-cpu MD5 state, which caused successive calls to the function to chain upon each other. While it's not entirely clear that this kind of chaining is absolutely necessary when using a secure PRF like siphash, it can't hurt, and the timing of the call chain does add a degree of natural entropy. So, in keeping with this design, instead of the massive per-cpu 64-byte MD5 state, there is instead a per-cpu previously returned value for chaining. The speed benefits are substantial: | siphash | md5| speedup | -- get_random_long | 137130 | 415983 | 3.03x | get_random_int | 86384 | 343323 | 3.97x | Signed-off-by: Jason A. DonenfeldCc: Jean-Philippe Aumasson Cc: Ted Tso --- drivers/char/random.c | 84 +- include/linux/random.h | 1 - init/main.c| 1 - 3 files changed, 49 insertions(+), 37 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index d6876d506220..ea9858d9d8b4 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -262,6 +262,7 @@ #include #include #include +#include #include #include @@ -2042,17 +2043,31 @@ struct ctl_table random_table[] = { }; #endif /* CONFIG_SYSCTL */ -static u32 random_int_secret[MD5_MESSAGE_BYTES / 4] cacheline_aligned; -int random_int_secret_init(void) +struct random_int_secret { + siphash_key_t secret; + u64 chaining; + unsigned long birthdate; + bool initialized; +}; +static DEFINE_PER_CPU(struct random_int_secret, random_int_secret); + +enum { + SECRET_ROTATION_TIME = HZ * 60 * 5 +}; + +static struct random_int_secret *get_random_int_secret(void) { - get_random_bytes(random_int_secret, sizeof(random_int_secret)); - return 0; + struct random_int_secret *secret = _cpu_var(random_int_secret); + if (unlikely(!secret->initialized || +!time_is_after_jiffies(secret->birthdate + SECRET_ROTATION_TIME))) { + secret->initialized = true; + secret->birthdate = jiffies; + get_random_bytes(secret->secret, sizeof(secret->secret)); + } + return secret; } -static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash) - __aligned(sizeof(unsigned long)); - /* * Get a random word for internal kernel use only. Similar to urandom but * with the goal of minimal entropy pool depletion. As a result, the random @@ -2061,20 +2076,20 @@ static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash) */ unsigned int get_random_int(void) { - __u32 *hash; - unsigned int ret; - - if (arch_get_random_int()) - return ret; - - hash = get_cpu_var(get_random_int_hash); - - hash[0] += current->pid + jiffies + random_get_entropy(); - md5_transform(hash, random_int_secret); - ret = hash[0]; - put_cpu_var(get_random_int_hash); - - return ret; + unsigned int arch_result; + u64 result; + struct random_int_secret *secret; + + if (arch_get_random_int(_result)) + return arch_result; + + secret = get_random_int_secret(); + result = siphash_3u64(secret->chaining, jiffies, + (u64)random_get_entropy() + current->pid, + secret->secret); + secret->chaining += result; + put_cpu_var(secret); + return result; } EXPORT_SYMBOL(get_random_int); @@ -2083,20 +2098,19 @@ EXPORT_SYMBOL(get_random_int); */ unsigned long get_random_long(void) { - __u32 *hash; - unsigned long ret; - - if (arch_get_random_long()) - return ret; - - hash = get_cpu_var(get_random_int_hash); - - hash[0] += current->pid + jiffies + random_get_entropy(); - md5_transform(hash, random_int_secret); - ret = *(unsigned long *)hash; - put_cpu_var(get_random_int_hash); - - return ret; + unsigned long arch_result; + u64 result; + struct random_int_secret *secret; + + if (arch_get_random_long(_result)) + return arch_result; + + secret = get_random_int_secret(); + result = siphash_3u64(secret->chaining, jiffies, random_get_entropy() + + current->pid, secret->secret); + secret->chaining += result; + put_cpu_var(secret); + return result; } EXPORT_SYMBOL(get_random_long); diff --git a/include/linux/random.h b/include/linux/random.h index
[PATCH v7 4/6] md5: remove from lib and only live in crypto
The md5_transform function is no longer used any where in the tree, except for the crypto api's actual implementation of md5, so we can drop the function from lib and put it as a static function of the crypto file, where it belongs. There should be no new users of md5_transform, anyway, since there are more modern ways of doing what it once achieved. Signed-off-by: Jason A. Donenfeld--- crypto/md5.c | 95 +++- lib/Makefile | 2 +- lib/md5.c| 95 3 files changed, 95 insertions(+), 97 deletions(-) delete mode 100644 lib/md5.c diff --git a/crypto/md5.c b/crypto/md5.c index 2355a7c25c45..f7ae1a48225b 100644 --- a/crypto/md5.c +++ b/crypto/md5.c @@ -21,9 +21,11 @@ #include #include #include -#include #include +#define MD5_DIGEST_WORDS 4 +#define MD5_MESSAGE_BYTES 64 + const u8 md5_zero_message_hash[MD5_DIGEST_SIZE] = { 0xd4, 0x1d, 0x8c, 0xd9, 0x8f, 0x00, 0xb2, 0x04, 0xe9, 0x80, 0x09, 0x98, 0xec, 0xf8, 0x42, 0x7e, @@ -47,6 +49,97 @@ static inline void cpu_to_le32_array(u32 *buf, unsigned int words) } } +#define F1(x, y, z)(z ^ (x & (y ^ z))) +#define F2(x, y, z)F1(z, x, y) +#define F3(x, y, z)(x ^ y ^ z) +#define F4(x, y, z)(y ^ (x | ~z)) + +#define MD5STEP(f, w, x, y, z, in, s) \ + (w += f(x, y, z) + in, w = (w<>(32-s)) + x) + +static void md5_transform(__u32 *hash, __u32 const *in) +{ + u32 a, b, c, d; + + a = hash[0]; + b = hash[1]; + c = hash[2]; + d = hash[3]; + + MD5STEP(F1, a, b, c, d, in[0] + 0xd76aa478, 7); + MD5STEP(F1, d, a, b, c, in[1] + 0xe8c7b756, 12); + MD5STEP(F1, c, d, a, b, in[2] + 0x242070db, 17); + MD5STEP(F1, b, c, d, a, in[3] + 0xc1bdceee, 22); + MD5STEP(F1, a, b, c, d, in[4] + 0xf57c0faf, 7); + MD5STEP(F1, d, a, b, c, in[5] + 0x4787c62a, 12); + MD5STEP(F1, c, d, a, b, in[6] + 0xa8304613, 17); + MD5STEP(F1, b, c, d, a, in[7] + 0xfd469501, 22); + MD5STEP(F1, a, b, c, d, in[8] + 0x698098d8, 7); + MD5STEP(F1, d, a, b, c, in[9] + 0x8b44f7af, 12); + MD5STEP(F1, c, d, a, b, in[10] + 0x5bb1, 17); + MD5STEP(F1, b, c, d, a, in[11] + 0x895cd7be, 22); + MD5STEP(F1, a, b, c, d, in[12] + 0x6b901122, 7); + MD5STEP(F1, d, a, b, c, in[13] + 0xfd987193, 12); + MD5STEP(F1, c, d, a, b, in[14] + 0xa679438e, 17); + MD5STEP(F1, b, c, d, a, in[15] + 0x49b40821, 22); + + MD5STEP(F2, a, b, c, d, in[1] + 0xf61e2562, 5); + MD5STEP(F2, d, a, b, c, in[6] + 0xc040b340, 9); + MD5STEP(F2, c, d, a, b, in[11] + 0x265e5a51, 14); + MD5STEP(F2, b, c, d, a, in[0] + 0xe9b6c7aa, 20); + MD5STEP(F2, a, b, c, d, in[5] + 0xd62f105d, 5); + MD5STEP(F2, d, a, b, c, in[10] + 0x02441453, 9); + MD5STEP(F2, c, d, a, b, in[15] + 0xd8a1e681, 14); + MD5STEP(F2, b, c, d, a, in[4] + 0xe7d3fbc8, 20); + MD5STEP(F2, a, b, c, d, in[9] + 0x21e1cde6, 5); + MD5STEP(F2, d, a, b, c, in[14] + 0xc33707d6, 9); + MD5STEP(F2, c, d, a, b, in[3] + 0xf4d50d87, 14); + MD5STEP(F2, b, c, d, a, in[8] + 0x455a14ed, 20); + MD5STEP(F2, a, b, c, d, in[13] + 0xa9e3e905, 5); + MD5STEP(F2, d, a, b, c, in[2] + 0xfcefa3f8, 9); + MD5STEP(F2, c, d, a, b, in[7] + 0x676f02d9, 14); + MD5STEP(F2, b, c, d, a, in[12] + 0x8d2a4c8a, 20); + + MD5STEP(F3, a, b, c, d, in[5] + 0xfffa3942, 4); + MD5STEP(F3, d, a, b, c, in[8] + 0x8771f681, 11); + MD5STEP(F3, c, d, a, b, in[11] + 0x6d9d6122, 16); + MD5STEP(F3, b, c, d, a, in[14] + 0xfde5380c, 23); + MD5STEP(F3, a, b, c, d, in[1] + 0xa4beea44, 4); + MD5STEP(F3, d, a, b, c, in[4] + 0x4bdecfa9, 11); + MD5STEP(F3, c, d, a, b, in[7] + 0xf6bb4b60, 16); + MD5STEP(F3, b, c, d, a, in[10] + 0xbebfbc70, 23); + MD5STEP(F3, a, b, c, d, in[13] + 0x289b7ec6, 4); + MD5STEP(F3, d, a, b, c, in[0] + 0xeaa127fa, 11); + MD5STEP(F3, c, d, a, b, in[3] + 0xd4ef3085, 16); + MD5STEP(F3, b, c, d, a, in[6] + 0x04881d05, 23); + MD5STEP(F3, a, b, c, d, in[9] + 0xd9d4d039, 4); + MD5STEP(F3, d, a, b, c, in[12] + 0xe6db99e5, 11); + MD5STEP(F3, c, d, a, b, in[15] + 0x1fa27cf8, 16); + MD5STEP(F3, b, c, d, a, in[2] + 0xc4ac5665, 23); + + MD5STEP(F4, a, b, c, d, in[0] + 0xf4292244, 6); + MD5STEP(F4, d, a, b, c, in[7] + 0x432aff97, 10); + MD5STEP(F4, c, d, a, b, in[14] + 0xab9423a7, 15); + MD5STEP(F4, b, c, d, a, in[5] + 0xfc93a039, 21); + MD5STEP(F4, a, b, c, d, in[12] + 0x655b59c3, 6); + MD5STEP(F4, d, a, b, c, in[3] + 0x8f0ccc92, 10); + MD5STEP(F4, c, d, a, b, in[10] + 0xffeff47d, 15); + MD5STEP(F4, b, c, d, a, in[1] + 0x85845dd1, 21); + MD5STEP(F4, a, b, c, d, in[8] + 0x6fa87e4f, 6); + MD5STEP(F4, d, a, b, c, in[15] + 0xfe2ce6e0, 10); + MD5STEP(F4, c, d, a, b, in[6] + 0xa3014314, 15); +
[PATCH v7 1/6] siphash: add cryptographically secure PRF
SipHash is a 64-bit keyed hash function that is actually a cryptographically secure PRF, like HMAC. Except SipHash is super fast, and is meant to be used as a hashtable keyed lookup function, or as a general PRF for short input use cases, such as sequence numbers or RNG chaining. For the first usage: There are a variety of attacks known as "hashtable poisoning" in which an attacker forms some data such that the hash of that data will be the same, and then preceeds to fill up all entries of a hashbucket. This is a realistic and well-known denial-of-service vector. Currently hashtables use jhash, which is fast but not secure, and some kind of rotating key scheme (or none at all, which isn't good). SipHash is meant as a replacement for jhash in these cases. There are a modicum of places in the kernel that are vulnerable to hashtable poisoning attacks, either via userspace vectors or network vectors, and there's not a reliable mechanism inside the kernel at the moment to fix it. The first step toward fixing these issues is actually getting a secure primitive into the kernel for developers to use. Then we can, bit by bit, port things over to it as deemed appropriate. While SipHash is extremely fast for a cryptographically secure function, it is likely a bit slower than the insecure jhash, and so replacements will be evaluated on a case-by-case basis based on whether or not the difference in speed is negligible and whether or not the current jhash usage poses a real security risk. For the second usage: A few places in the kernel are using MD5 or SHA1 for creating secure sequence numbers, syn cookies, port numbers, or fast random numbers. SipHash is a faster and more fitting, and more secure replacement for MD5 in those situations. Replacing MD5 and SHA1 with SipHash for these uses is obvious and straight-forward, and so is submitted along with this patch series. There shouldn't be much of a debate over its efficacy. Dozens of languages are already using this internally for their hash tables and PRFs. Some of the BSDs already use this in their kernels. SipHash is a widely known high-speed solution to a widely known set of problems, and it's time we catch-up. Signed-off-by: Jason A. DonenfeldCc: Jean-Philippe Aumasson Cc: Linus Torvalds Cc: Eric Biggers Cc: David Laight Cc: Eric Dumazet --- Documentation/siphash.txt | 79 MAINTAINERS | 7 ++ include/linux/siphash.h | 79 lib/Kconfig.debug | 6 +- lib/Makefile | 5 +- lib/siphash.c | 232 ++ lib/test_siphash.c| 119 7 files changed, 522 insertions(+), 5 deletions(-) create mode 100644 Documentation/siphash.txt create mode 100644 include/linux/siphash.h create mode 100644 lib/siphash.c create mode 100644 lib/test_siphash.c diff --git a/Documentation/siphash.txt b/Documentation/siphash.txt new file mode 100644 index ..39ff7f0438e7 --- /dev/null +++ b/Documentation/siphash.txt @@ -0,0 +1,79 @@ + SipHash - a short input PRF +--- +Written by Jason A. Donenfeld + +SipHash is a cryptographically secure PRF -- a keyed hash function -- that +performs very well for short inputs, hence the name. It was designed by +cryptographers Daniel J. Bernstein and Jean-Philippe Aumasson. It is intended +as a replacement for some uses of: `jhash`, `md5_transform`, `sha_transform`, +and so forth. + +SipHash takes a secret key filled with randomly generated numbers and either +an input buffer or several input integers. It spits out an integer that is +indistinguishable from random. You may then use that integer as part of secure +sequence numbers, secure cookies, or mask it off for use in a hash table. + +1. Generating a key + +Keys should always be generated from a cryptographically secure source of +random numbers, either using get_random_bytes or get_random_once: + +siphash_key_t key; +get_random_bytes(key, sizeof(key)); + +If you're not deriving your key from here, you're doing it wrong. + +2. Using the functions + +There are two variants of the function, one that takes a list of integers, and +one that takes a buffer: + +u64 siphash(const void *data, size_t len, siphash_key_t key); + +And: + +u64 siphash_1u64(u64, siphash_key_t key); +u64 siphash_2u64(u64, u64, siphash_key_t key); +u64 siphash_3u64(u64, u64, u64, siphash_key_t key); +u64 siphash_4u64(u64, u64, u64, u64, siphash_key_t key); +u64 siphash_1u32(u32, siphash_key_t key); +u64 siphash_2u32(u32, u32, siphash_key_t key); +u64 siphash_3u32(u32, u32, u32, siphash_key_t key); +u64 siphash_4u32(u32, u32, u32, u32, siphash_key_t key); + +If you pass the generic siphash function something of a constant length, it
[PATCH v7 0/6] The SipHash Patchset
Hey folks, Again we've made huge progress, with this latest version now shipping Jean-Phillipe Aumasson's HalfSipHash, which should be much more competitive with jhash (in addition to being more secure, of course). There are dozens of little cleanups and improvements right and left throughout this series, so I urge you to take a look at the whole thing. I've tried to take into consideration lots of concerns and suggestions from many of you over the last week. There is also now documentation! And the test suite now has full coverage of all functions. Finally, there's been some significant benchmarking, and now a few commit messages have some performance numbers. Please send your Reviewed-by lines as you see fit. Regards, Jason Jason A. Donenfeld (6): siphash: add cryptographically secure PRF secure_seq: use SipHash in place of MD5 random: use SipHash in place of MD5 md5: remove from lib and only live in crypto syncookies: use SipHash in place of SHA1 siphash: implement HalfSipHash1-3 for hash tables Documentation/siphash.txt | 154 + MAINTAINERS | 7 + crypto/md5.c | 95 +++- drivers/char/random.c | 84 --- include/linux/random.h| 1 - include/linux/siphash.h | 133 +++ init/main.c | 1 - lib/Kconfig.debug | 6 +- lib/Makefile | 7 +- lib/md5.c | 95 lib/siphash.c | 548 ++ lib/test_siphash.c| 208 ++ net/core/secure_seq.c | 135 +--- net/ipv4/syncookies.c | 20 +- net/ipv6/syncookies.c | 37 ++-- 15 files changed, 1274 insertions(+), 257 deletions(-) create mode 100644 Documentation/siphash.txt create mode 100644 include/linux/siphash.h delete mode 100644 lib/md5.c create mode 100644 lib/siphash.c create mode 100644 lib/test_siphash.c -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 11:27 PM, Theodore Ts'owrote: > And "with enough registers" includes ARM and MIPS, right? So the only > real problem is 32-bit x86, and you're right, at that point, only > people who might care are people who are using a space-radiation > hardened 386 --- and they're not likely to be doing high throughput > TCP connections. :-) Plus the benchmark was bogus anyway, and when I built a more specific harness -- actually comparing the TCP sequence number functions -- SipHash was faster than MD5, even on register starved x86. So I think we're fine and this chapter of the discussion can come to a close, in order to move on to more interesting things. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote: > SipHash annihilates the competition on 64-bit superscalar hardware. > SipHash dominates the field on 64-bit in-order hardware. > SipHash wins easily on 32-bit hardware *with enough registers*. > On register-starved 32-bit machines, it really struggles. And "with enough registers" includes ARM and MIPS, right? So the only real problem is 32-bit x86, and you're right, at that point, only people who might care are people who are using a space-radiation hardened 386 --- and they're not likely to be doing high throughput TCP connections. :-) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] crypto: testmgr: Use heap buffer for acomp test input
Christopher Covington reported a crash on aarch64 on recent Fedora kernels: kernel BUG at ./include/linux/scatterlist.h:140! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Modules linked in: CPU: 2 PID: 752 Comm: cryptomgr_test Not tainted 4.9.0-11815-ge93b1cc #162 Hardware name: linux,dummy-virt (DT) task: 80007c650080 task.stack: 8891 PC is at sg_init_one+0xa0/0xb8 LR is at sg_init_one+0x24/0xb8 ... [] sg_init_one+0xa0/0xb8 [] test_acomp+0x10c/0x438 [] alg_test_comp+0xb0/0x118 [] alg_test+0x17c/0x2f0 [] cryptomgr_test+0x44/0x50 [] kthread+0xf8/0x128 [] ret_from_fork+0x10/0x50 The test vectors used for input are part of the kernel image. These inputs are passed as a buffer to sg_init_one which eventually blows up with BUG_ON(!virt_addr_valid(buf)). On arm64, virt_addr_valid returns false for the kernel image since virt_to_page will not return the correct page. Fix this by copying the input vectors to heap buffer before setting up the scatterlist. Reported-by: Christopher CovingtonFixes: d7db7a882deb ("crypto: acomp - update testmgr with support for acomp") Signed-off-by: Laura Abbott --- crypto/testmgr.c | 30 -- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/crypto/testmgr.c b/crypto/testmgr.c index f616ad7..44e888b 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -1461,16 +1461,25 @@ static int test_acomp(struct crypto_acomp *tfm, struct comp_testvec *ctemplate, for (i = 0; i < ctcount; i++) { unsigned int dlen = COMP_BUF_SIZE; int ilen = ctemplate[i].inlen; + void *input_vec; + input_vec = kmalloc(ilen, GFP_KERNEL); + if (!input_vec) { + ret = -ENOMEM; + goto out; + } + + memcpy(input_vec, ctemplate[i].input, ilen); memset(output, 0, dlen); init_completion(); - sg_init_one(, ctemplate[i].input, ilen); + sg_init_one(, input_vec, ilen); sg_init_one(, output, dlen); req = acomp_request_alloc(tfm); if (!req) { pr_err("alg: acomp: request alloc failed for %s\n", algo); + kfree(input_vec); ret = -ENOMEM; goto out; } @@ -1483,6 +1492,7 @@ static int test_acomp(struct crypto_acomp *tfm, struct comp_testvec *ctemplate, if (ret) { pr_err("alg: acomp: compression failed on test %d for %s: ret=%d\n", i + 1, algo, -ret); + kfree(input_vec); acomp_request_free(req); goto out; } @@ -1491,6 +1501,7 @@ static int test_acomp(struct crypto_acomp *tfm, struct comp_testvec *ctemplate, pr_err("alg: acomp: Compression test %d failed for %s: output len = %d\n", i + 1, algo, req->dlen); ret = -EINVAL; + kfree(input_vec); acomp_request_free(req); goto out; } @@ -1500,26 +1511,37 @@ static int test_acomp(struct crypto_acomp *tfm, struct comp_testvec *ctemplate, i + 1, algo); hexdump(output, req->dlen); ret = -EINVAL; + kfree(input_vec); acomp_request_free(req); goto out; } + kfree(input_vec); acomp_request_free(req); } for (i = 0; i < dtcount; i++) { unsigned int dlen = COMP_BUF_SIZE; int ilen = dtemplate[i].inlen; + void *input_vec; + + input_vec = kmalloc(ilen, GFP_KERNEL); + if (!input_vec) { + ret = -ENOMEM; + goto out; + } + memcpy(input_vec, dtemplate[i].input, ilen); memset(output, 0, dlen); init_completion(); - sg_init_one(, dtemplate[i].input, ilen); + sg_init_one(, input_vec, ilen); sg_init_one(, output, dlen); req = acomp_request_alloc(tfm); if (!req) { pr_err("alg: acomp: request alloc failed for %s\n", algo); + kfree(input_vec); ret = -ENOMEM; goto out; } @@ -1532,6 +1554,7 @@ static int test_acomp(struct crypto_acomp *tfm, struct comp_testvec *ctemplate, if (ret) { pr_err("alg: acomp: decompression failed on test %d for %s: ret=%d\n",
Re: algif for compression?
On Fri, Dec 16, 2016 at 07:41:23PM +0530, abed mohammad kamaluddin wrote: > > Thanks Herbert. Are there timelines or ongoing efforts for moving > IPcomp/Ipsec to use acomp? Or any proposals that have been or need to > be taken up in this regard. Someone needs to write the patches :) -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 7:37 PM, George Spelvinwrote: > SipHash annihilates the competition on 64-bit superscalar hardware. > SipHash dominates the field on 64-bit in-order hardware. > SipHash wins easily on 32-bit hardware *with enough registers*. > On register-starved 32-bit machines, it really struggles. > > As I explained, in that last case, SipHash barely wins at all. > (On a P4, it actually *loses* to MD5, not that anyone cares. Running > on a P4 and caring about performance are mutually exclusive.) >From the discussion off list which examined your benchmark code, it looks like we're going to move ahead with SipHash. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
Eric Dumazet wrote: > Now I am quite confused. > > George said : >> Cycles per byte on 1024 bytes of data: >> Pentium Core 2 Ivy >> 4 Duo Bridge >> SipHash-2-4 38.9 8.3 5.8 >> HalfSipHash-2-4 12.7 4.5 3.2 >> MD58.3 5.7 4.7 > > That really was for 1024 bytes blocks, so pretty much useless for our > discussion ? No, they're actually quite relevant, but you have to interpret them correctly. I thought I explained in the text following that table, but let me make it clearer: To find the time to compute the SipHash of N bytes, round (N+17) up to the next multiple of 8 bytes and multiply by the numbers above. To find the time to compute the HalfSipHash of N bytes, round (N+9) up to the next multiple of 4 bytes and multiply by the numbers above. To find the time to compute the MD5 of N bytes, round (N+9) up to the next multiple of 64 bytes and multiply by the numbers above. It's the different rounding rules that make all the difference. For small input blocks, SipHash can be slower per byte yet still faster because it hashes fewer bytes. > Reading your numbers last week, I thought SipHash was faster, but George > numbers are giving the opposite impression. SipHash annihilates the competition on 64-bit superscalar hardware. SipHash dominates the field on 64-bit in-order hardware. SipHash wins easily on 32-bit hardware *with enough registers*. On register-starved 32-bit machines, it really struggles. As I explained, in that last case, SipHash barely wins at all. (On a P4, it actually *loses* to MD5, not that anyone cares. Running on a P4 and caring about performance are mutually exclusive.) -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
Linus wrote: >> How much does kernel_fpu_begin()/kernel_fpu_end() cost? > > It's now better than it used to be, but it's absolutely disastrous > still. We're talking easily many hundreds of cycles. Under some loads, > thousands. I think I've been thoroughly dissuaded, but just to clarify one thing that resembles a misunderstanding: > In contrast, in reality, especially with things like "do it once or > twice per incoming packet", you'll easily hit the absolute worst > cases, where not only does it take a few hundred cycles to save the FP > state, you'll then return to user space in between packets, which > triggers the slow-path return code and reloads the FP state, which is > another few hundred cycles plus. Everything being discussed is per-TCP-connection overhead, *not* per packet. (Twice for outgoing connections, because one is to generate the ephemeral port number.) I know you know this, but I don't want anyone spectating to be confused about it. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
On Wed, Dec 21, 2016 at 7:55 AM, George Spelvinwrote: > > How much does kernel_fpu_begin()/kernel_fpu_end() cost? It's now better than it used to be, but it's absolutely disastrous still. We're talking easily many hundreds of cycles. Under some loads, thousands. And I warn you already: it will _benchmark_ a hell of a lot better than it will work in reality. In benchmarks, you'll hit all the optimizations ("oh, I've already saved away all the FP registers, no need to do it again"). In contrast, in reality, especially with things like "do it once or twice per incoming packet", you'll easily hit the absolute worst cases, where not only does it take a few hundred cycles to save the FP state, you'll then return to user space in between packets, which triggers the slow-path return code and reloads the FP state, which is another few hundred cycles plus. Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX unit and keep it powered up for a while afterwards", while in real life you would quite easily hit the "oh, AVX is powered down because we were idle, now it powers up at half speed which is another latency hit _and_ the AVX unit won't run full out anyway". Don't do it. There are basically no real situations where the AVX state optimizations help for the kernel. We just don't have the loop counts to make up for the problems it causes. The one exception is likely if you're doing things like high-throughput disk IO encryption, and then you'd be much better off using SHA256 instead (which often has hw encryption on modern CPU's - both x86 and ARM). (I'm sure that you could see it on some high-throughput network benchmark too when the benchmark entirely saturates the CPU. And then in real life it would suck horribly for all the reasons above). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
On Wed, 2016-12-21 at 11:39 -0500, Rik van Riel wrote: > Does anybody still have a P4? > > If they do, they're probably better off replacing > it with an Atom. The reduced power bills will pay > for replacing that P4 within a year or two. Well, maybe they have millions of units to replace. > > In short, I am not sure how important the P4 > performance numbers are, especially if we can > improve security for everybody else... Worth adding that the ISN or syncookie generation are less than 10% of the actual cost of handling a problematic (having to generate ISN or syncookie) TCP packet anyway. So we are talking of minors potential impact for '2000-era' cpus. Definitely I vote for using SipHash in TCP ASAP. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] crypto: testmgr: Use linear alias for test input
On Mon, Dec 19, 2016 at 03:37:26PM -0800, Laura Abbott wrote: > Christopher Covington reported a crash on aarch64 on recent Fedora > kernels: > > kernel BUG at ./include/linux/scatterlist.h:140! > Internal error: Oops - BUG: 0 [#1] PREEMPT SMP > Modules linked in: > CPU: 2 PID: 752 Comm: cryptomgr_test Not tainted 4.9.0-11815-ge93b1cc #162 > Hardware name: linux,dummy-virt (DT) > task: 80007c650080 task.stack: 8891 > PC is at sg_init_one+0xa0/0xb8 > LR is at sg_init_one+0x24/0xb8 > ... > [] sg_init_one+0xa0/0xb8 > [] test_acomp+0x10c/0x438 > [] alg_test_comp+0xb0/0x118 > [] alg_test+0x17c/0x2f0 > [] cryptomgr_test+0x44/0x50 > [] kthread+0xf8/0x128 > [] ret_from_fork+0x10/0x50 > > The test vectors used for input are part of the kernel image. These > inputs are passed as a buffer to sg_init_one which eventually blows up > with BUG_ON(!virt_addr_valid(buf)). On arm64, virt_addr_valid returns > false for the kernel image since virt_to_page will not return the > correct page. The kernel image is also aliased to the linear map so get > the linear alias and pass that to the scatterlist instead. > > Reported-by: Christopher Covington> Fixes: d7db7a882deb ("crypto: acomp - update testmgr with support for acomp") > Signed-off-by: Laura Abbott > --- > x86 supports virt_addr_valid working on kernel image addresses but arm64 is > more strict. This is the direction things have been moving with my > CONFIG_DEBUG_VIRTUAL series for arm64 which is tightening the definition of > __pa/__pa_symbol. Please fix this by copying the templates to kmalloced memory like the other test_* functions. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test AEAD/authenc algorithms from userspace
On Mon, Dec 19, 2016 at 04:08:11PM +0530, Harsh Jain wrote: > Hi Herbert, > > TLS default mode of operation is MAC-then-Encrypt for Authenc algos. > Currently framework only supports EtM used in IPSec. User space > programs like openssl cannot use af-alg interface to encrypt/decrypt > in TLS mode. > Are we going to support Mac-then-Encrypt mode in future kernel releases? If someone finally adds TLS to the kernel then we'll likely do something about it. Otherwise you can just separate it out into two operations via af-alg. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
On Wed, 2016-12-21 at 10:55 -0500, George Spelvin wrote: > Actually, DJB just made a very relevant suggestion. > > As I've mentioned, the 32-bit performance problems are an x86- > specific > problem. ARM does very well, and other processors aren't bad at all. > > SipHash fits very nicely (and runs very fast) in the MMX registers. > > They're 64 bits, and there are 8 of them, so the integer registers > can > be reserved for pointers and loop counters and all that. And there's > reference code available. > > How much does kernel_fpu_begin()/kernel_fpu_end() cost? Those can be very expensive. Almost certainly not worth it for small amounts of data. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: HalfSipHash Acceptable Usage
Hi Eric, On Wed, Dec 21, 2016 at 4:56 PM, Eric Dumazetwrote: > That really was for 1024 bytes blocks, so pretty much useless for our > discussion ? > > Reading your numbers last week, I thought SipHash was faster, but George > numbers are giving the opposite impression. > > I do not have a P4 to make tests, so I only can trust you or George. I'm not sure how George came up with those numbers, but the ones I sent are output from that benchmark function in the last email. I'd be interested in learning this too. As mentioned in the last email, it looks like potential 32-bit issues are really just specific to old Intel chips. Other 32-bit architectures do fine. So, for new kernels, even if somehow there is a tiny performance regression (though I couldn't see one) on old architectures, I really doubt it will affect anybody in practice. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
Hi George, On Wed, Dec 21, 2016 at 4:55 PM, George Spelvinwrote: > Actually, DJB just made a very relevant suggestion. > > As I've mentioned, the 32-bit performance problems are an x86-specific > problem. ARM does very well, and other processors aren't bad at all. > > SipHash fits very nicely (and runs very fast) in the MMX registers. > > They're 64 bits, and there are 8 of them, so the integer registers can > be reserved for pointers and loop counters and all that. And there's > reference code available. > > How much does kernel_fpu_begin()/kernel_fpu_end() cost? In my experience, these functions are only worth calling when processing more significant amounts of data. I don't have any benchmarks, but when I _remove_ all of these calls in a kernel, accelerated crypto gets noticeably faster (until the system crashes). We can measure it, though. By the way, if somehow SipHash becomes acceptably fast on x86, would you consider HalfSipHash for hash tables to be no longer needed? Or do you suspect that HalfSipHash will always be faster even on, say, 32-bit ARM. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
On Wed, 2016-12-21 at 15:42 +0100, Jason A. Donenfeld wrote: > Hi Eric, > > I computed performance numbers for both 32-bit and 64-bit using the > actual functions in which talking about replacing MD5 with SipHash. > The basic harness is here [1] if you're curious. SipHash was a pretty > clear winner for both cases. > > x86_64: > [1.714302] secure_tcpv6_sequence_number_md5# cycles: 102373398 > [1.747685] secure_tcp_sequence_number_md5# cycles: 92042258 > [1.773522] secure_tcpv6_sequence_number_siphash# cycles: 70786533 > [1.798701] secure_tcp_sequence_number_siphash# cycles: 68941043 > > x86: > [1.635749] secure_tcpv6_sequence_number_md5# cycles: 106016335 > [1.670259] secure_tcp_sequence_number_md5# cycles: 95670512 > [1.708387] secure_tcpv6_sequence_number_siphash# cycles: 105988635 > [1.740264] secure_tcp_sequence_number_siphash# cycles: 88225395 > > >>> 102373398 > 70786533 > True > >>> 92042258 > 68941043 > True > >>> 106016335 > 105988635 > True > >>> 95670512 > 88225395 > True > > While MD5 is probably faster for some kind of large-data > cycles-per-byte, due to its 64-byte internal state, SipHash -- the > "Sip" part standing "Short Input PRF" -- is fast for shorter inputs. > In practice with the functions we're talking about replacing, there's > no need to hash 64-bytes. So, SipHash comes out faster and more > secure. > > I also haven't begun to look focusedly at the assembly my SipHash > implemention is generating, which means there's still window for even > more performance improvements. > > Jason > > > [1] > https://git.zx2c4.com/linux-dev/tree/net/core/secure_seq.c?h=siphash-bench#n194 Now I am quite confused. George said : > Cycles per byte on 1024 bytes of data: > Pentium Core 2 Ivy > 4 Duo Bridge > SipHash-2-4 38.9 8.3 5.8 > HalfSipHash-2-4 12.7 4.5 3.2 > MD58.3 5.7 4.7 That really was for 1024 bytes blocks, so pretty much useless for our discussion ? Reading your numbers last week, I thought SipHash was faster, but George numbers are giving the opposite impression. I do not have a P4 to make tests, so I only can trust you or George. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
Actually, DJB just made a very relevant suggestion. As I've mentioned, the 32-bit performance problems are an x86-specific problem. ARM does very well, and other processors aren't bad at all. SipHash fits very nicely (and runs very fast) in the MMX registers. They're 64 bits, and there are 8 of them, so the integer registers can be reserved for pointers and loop counters and all that. And there's reference code available. How much does kernel_fpu_begin()/kernel_fpu_end() cost? Although there are a lot of pre-MMX x86es in embedded control applications, I don't think anyone is worried about their networking performance. (Specifically, all of this affects only connection setup, not throughput on established connections.) -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
Hi Eric, I computed performance numbers for both 32-bit and 64-bit using the actual functions in which talking about replacing MD5 with SipHash. The basic harness is here [1] if you're curious. SipHash was a pretty clear winner for both cases. x86_64: [1.714302] secure_tcpv6_sequence_number_md5# cycles: 102373398 [1.747685] secure_tcp_sequence_number_md5# cycles: 92042258 [1.773522] secure_tcpv6_sequence_number_siphash# cycles: 70786533 [1.798701] secure_tcp_sequence_number_siphash# cycles: 68941043 x86: [1.635749] secure_tcpv6_sequence_number_md5# cycles: 106016335 [1.670259] secure_tcp_sequence_number_md5# cycles: 95670512 [1.708387] secure_tcpv6_sequence_number_siphash# cycles: 105988635 [1.740264] secure_tcp_sequence_number_siphash# cycles: 88225395 >>> 102373398 > 70786533 True >>> 92042258 > 68941043 True >>> 106016335 > 105988635 True >>> 95670512 > 88225395 True While MD5 is probably faster for some kind of large-data cycles-per-byte, due to its 64-byte internal state, SipHash -- the "Sip" part standing "Short Input PRF" -- is fast for shorter inputs. In practice with the functions we're talking about replacing, there's no need to hash 64-bytes. So, SipHash comes out faster and more secure. I also haven't begun to look focusedly at the assembly my SipHash implemention is generating, which means there's still window for even more performance improvements. Jason [1] https://git.zx2c4.com/linux-dev/tree/net/core/secure_seq.c?h=siphash-bench#n194 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HalfSipHash Acceptable Usage
Hi George, On Wed, Dec 21, 2016 at 7:34 AM, George Spelvinwrote: > In fact, I have an idea. Allow me to make the following concrete > suggestion for using HalfSipHash with 128 bits of key material: > > - 64 bits are used as the key. > - The other 64 bits are used as an IV which is prepended to > the message to be hashed. > > As a matter of practical implementation, we precompute the effect > of hashing the IV and store the 128-bit HalfSipHash state, which > is used just like a 128-bit key. > > Because of the way it is constructed, it is obviously no weaker than > standard HalfSipHash's 64-bit security claim. > > I don't know the security of this, and it's almost certainly weaker than > 128 bits, but I *hope* it's at least a few bits stronger than 64 bits. > 80 would be enough to dissuade any attacker without a six-figure budget > (that's per attack, not a one-time capital investment). 96 would be > ample for our purposes. > > What I do know is that it makes a brute-force attack without > significant cryptanalytic effort impossible. Depends who's doing the cryptanalytic effort I guess. Please don't roll your own crypto. It's a dangerous road. Putting homebrew crypto into the kernel would be an error. Let's stick with the constructions and security margins that the cryptographers give us. JP made that fairly clear, I thought. There are already people working on this problem who undergo peer review and a career devoted to solving these problems. One result for small systems that need 128-bit security is Chaskey, which you can go read about if you're curious. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/3] drivers: crypto: Add the Virtual Function driver for CPT
Hello I have some comment inline On Wed, Dec 21, 2016 at 11:56:12AM +, george.cher...@cavium.com wrote: > From: George Cherian> > Enable the CPT VF driver. CPT is the cryptographic Accelaration Unit typo acceleration [...] > +static inline void update_input_data(struct cpt_request_info *req_info, > + struct scatterlist *inp_sg, > + u32 nbytes, u32 *argcnt) > +{ > + req_info->req.dlen += nbytes; > + > + while (nbytes) { > + u32 len = min(nbytes, inp_sg->length); > + u8 *ptr = page_address(sg_page(inp_sg)) + inp_sg->offset; You could use sg_virt instead. But do you have tested your accelerator with user space data (via cryptodev/AF_ALG) ? In my memory, you better use kmap() instead of this direct memory address. [...] > +static inline u32 cvm_enc_dec(struct ablkcipher_request *req, u32 enc, > + u32 cipher_type) > +{ > + struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(req); > + struct cvm_enc_ctx *ctx = crypto_ablkcipher_ctx(tfm); > + u32 key_type = AES_128_BIT; > + struct cvm_req_ctx *rctx = ablkcipher_request_ctx(req); > + u32 enc_iv_len = crypto_ablkcipher_ivsize(tfm); > + struct fc_context *fctx = >fctx; > + struct cpt_request_info *req_info = >cpt_req; > + void *cdev = NULL; > + u32 status = -1; Doable but dangerous Furthermore, cptvf_do_request return int so why use u32 ? [...] > +void cvm_enc_dec_exit(struct crypto_tfm *tfm) > +{ > + return; > +} So you could remove all reference to this function [...] > +static inline int cav_register_algs(void) > +{ > + int err = 0; > + > + err = crypto_register_algs(algs, ARRAY_SIZE(algs)); > + if (err) { > + pr_err("Error in aes module init %d\n", err); > + return -1; This is not a standard error code [...] > diff --git a/drivers/crypto/cavium/cpt/cptvf_algs.h > b/drivers/crypto/cavium/cpt/cptvf_algs.h > new file mode 100644 > index 000..fcb287b > --- /dev/null > +++ b/drivers/crypto/cavium/cpt/cptvf_algs.h [...] > + > +u32 cptvf_do_request(void *cptvf, struct cpt_request_info *req); latter this function is set "return int" [...] > +static int cptvf_probe(struct pci_dev *pdev, const struct pci_device_id *ent) > +{ > + struct device *dev = >dev; > + struct cpt_vf *cptvf; > + interr; > + > + cptvf = devm_kzalloc(dev, sizeof(struct cpt_vf), GFP_KERNEL); use sizeof(*cptvf) and checkpatch [...] > +static int setup_sgio_components(struct cpt_vf *cptvf, struct buf_ptr *list, > + int buf_count, u8 *buffer) > +{ > + int ret = 0, i, j; > + int components; > + struct sglist_component *sg_ptr = NULL; > + struct pci_dev *pdev = cptvf->pdev; > + > + if (unlikely(!list)) { > + pr_err("Input List pointer is NULL\n"); > + ret = -EFAULT; > + return ret; You could directly return -EFAULT and use dev_err() > + } > + > + for (i = 0; i < buf_count; i++) { > + if (likely(list[i].vptr)) { > + list[i].dma_addr = dma_map_single(>dev, > + list[i].vptr, > + list[i].size, > + DMA_BIDIRECTIONAL); > + if (unlikely(dma_mapping_error(>dev, > +list[i].dma_addr))) { > + pr_err("DMA map kernel buffer failed for > component: %d\n", > +i); Use dev_err [...] > + u16 g_sz_bytes = 0, s_sz_bytes = 0; > + int ret = 0; > + struct pci_dev *pdev = cptvf->pdev; > + > + if (req->incnt > MAX_SG_IN_CNT || req->outcnt > MAX_SG_OUT_CNT) { > + pr_err("Requestes SG components are higher than supported\n"); typo request and use dev_err In all files you have some pr_x that could be better use as dev_x > + ret = -EINVAL; > + goto scatter_gather_clean; > + } > + > + /* Setup gather (input) components */ > + g_sz_bytes = ((req->incnt + 3) / 4) * sizeof(struct sglist_component); > + info->gather_components = kzalloc((g_sz_bytes), GFP_KERNEL); unnecessary parenthesis > + if (!info->gather_components) { > + ret = -ENOMEM; > + goto scatter_gather_clean; > + } > + > + ret = setup_sgio_components(cptvf, req->in, > + req->incnt, > + info->gather_components); > + if (ret) { > + pr_err("Failed to setup gather list\n"); > + ret = -EFAULT; > + goto scatter_gather_clean; > + } > + > + /* Setup scatter (output) components */ > + s_sz_bytes = ((req->outcnt + 3) / 4) * sizeof(struct sglist_component); > +
Re: dm-crypt optimization
On 12/20/2016 10:41 AM, Binoy Jayan wrote: > At a high level the goal is to maximize the size of data blocks that get > passed > to hardware accelerators, minimizing the overhead from setting up and tearing > down operations in the hardware. Currently dm-crypt itself is a big blocker as > it manually implements ESSIV and similar algorithms which allow per-block > encryption of the data so the low level operations from the crypto API can > only operate on a single block. This is done because currently the crypto API > doesn't have software implementations of these algorithms itself so dm-crypt > can't rely on it being able to provide the functionality. The plan to address > this was to provide some software implementations in the crypto API, then > update dm-crypt to rely on those. Even for a pure software implementation > with no hardware acceleration that should hopefully provide a small > optimization as we need to call into the crypto API less often but it's likely > to be marginal given the overhead of crypto, the real win would be on a system > that has an accelerator that can replace the software implementation. > > Currently dm-crypt handles data only in single blocks. This means that it > can't > make good use of hardware cryptography engines since there is an overhead to > each transaction with the engine but transfers must be split into block sized > chunks. Allowing the transfer of larger blocks e.g. 'struct bio', could > mitigate against these costs and could improve performance in operating > systems > with encrypted filesystems. Although qualcomm chipsets support another variant > of the device-mapper dm-req-crypt, it is not something generic and in > mainline-able state. Also, it only supports 'XTS-AES' mode of encryption and > is not compatible with other modes supported by dm-crypt. So the core problem is that your crypto accelerator can operate efficiently only with bigger batch sizes. How big blocks your crypto hw need to be able to operate more efficiently? What about 4k blocks (no batches), could it be usable trade-off? With some (backward incompatible) changes in LUKS format I would like to see support for encryption blocks equivalent to sectors size, so it basically means for 4k drive 4k encryption block. (This should decrease overhead, now is everything processed on 512 blocks only.) Support of bigger block sizes would be unsafe without additional mechanism that provides atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...) The above is not going against your proposal, I am just curious if this is enough to provide better performance on your hw accelerator or not. Milan > However, there are some challenges and a few possibilities to address this. I > request you to provide your suggestions on whether the points mentioned below > makes sense and if it could be done differently. > > 1. Move the 'real' IV generation algorithms to crypto layer (e.g. essiv) > 2. Increase the 'length' of the scatterlist nodes used in the crypto api. It >can be made equal to the size of a main memory segment (as defined in >'struct bio') as they are physcially contiguous. > 3. Multiple segments in 'struct bio' can be represented as scatterlist of all >segments in a 'struct bio'. > > 4. Move algorithms 'lmk' and 'tcw' (which are IV combined with hacks to the >cbc mode) to create a customized cbc algorithm, implemented in a seperate >file (e.g. cbc_lmk.c/cbc_tcw.c). As Milan suggested, these can't be treated >as real IVs as these include hacks to the cbc mode (and directly manipulate >encrypted data). > > 5. Move key selection logic to user space or always assume keycount as '1' >(as mentioned in the dm-crypt param format below) so that the key selection >logic does not have to be dependent on the sector number. This is necessary >as the key is selected otherwise based on sector number: > >key_index = sector & (key_count - 1) > >If block size for scatterlist nodes are increased beyond sector boundary >(which is what we plan to achieve, for performance), the key set for every >cipher operation cannot be changed at the sector level. > >dm-crypt param format : cipher[:keycount]-mode-iv:ivopts >Example : aes:2-cbc-essiv:sha256 > >Also as Milan suggested, it is not wise to move the key selection logic to >the crypto layer as it will prevent any changes to the key structure later. > > The following is a reference to an earlier patchset. It had the cipher mode > 'cbc' mixed up with the IV algorithms and is usually not the preferred way. > > Reference: > https://lkml.org/lkml/2016/12/13/65 > https://lkml.org/lkml/2016/12/13/66 > -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/3] drivers: crypto: Enable CPT options crypto for build
From: George CherianAdd the CPT options in crypto Kconfig and update the crypto Makefile Signed-off-by: George Cherian Reviewed-by: David Daney --- drivers/crypto/Kconfig | 1 + drivers/crypto/Makefile | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig index 4d2b81f..15f9040 100644 --- a/drivers/crypto/Kconfig +++ b/drivers/crypto/Kconfig @@ -484,6 +484,7 @@ config CRYPTO_DEV_MXS_DCP will be called mxs-dcp. source "drivers/crypto/qat/Kconfig" +source "drivers/crypto/cavium/cpt/Kconfig" config CRYPTO_DEV_QCE tristate "Qualcomm crypto engine accelerator" diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile index ad7250f..dd33290 100644 --- a/drivers/crypto/Makefile +++ b/drivers/crypto/Makefile @@ -32,3 +32,4 @@ obj-$(CONFIG_CRYPTO_DEV_VMX) += vmx/ obj-$(CONFIG_CRYPTO_DEV_SUN4I_SS) += sunxi-ss/ obj-$(CONFIG_CRYPTO_DEV_ROCKCHIP) += rockchip/ obj-$(CONFIG_CRYPTO_DEV_CHELSIO) += chelsio/ +obj-$(CONFIG_CRYPTO_DEV_CPT) += cavium/cpt/ -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/3] drivers: crypto: Add the Virtual Function driver for CPT
From: George CherianEnable the CPT VF driver. CPT is the cryptographic Accelaration Unit in Octeon-tx series of processors. Signed-off-by: George Cherian Reviewed-by: David Daney --- drivers/crypto/cavium/cpt/Makefile | 3 +- drivers/crypto/cavium/cpt/cptvf.h| 135 drivers/crypto/cavium/cpt/cptvf_algs.c | 424 drivers/crypto/cavium/cpt/cptvf_algs.h | 110 drivers/crypto/cavium/cpt/cptvf_main.c | 945 +++ drivers/crypto/cavium/cpt/cptvf_mbox.c | 205 ++ drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 581 drivers/crypto/cavium/cpt/request_manager.h | 147 + 8 files changed, 2549 insertions(+), 1 deletion(-) create mode 100644 drivers/crypto/cavium/cpt/cptvf.h create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.c create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.h create mode 100644 drivers/crypto/cavium/cpt/cptvf_main.c create mode 100644 drivers/crypto/cavium/cpt/cptvf_mbox.c create mode 100644 drivers/crypto/cavium/cpt/cptvf_reqmanager.c create mode 100644 drivers/crypto/cavium/cpt/request_manager.h diff --git a/drivers/crypto/cavium/cpt/Makefile b/drivers/crypto/cavium/cpt/Makefile index fe3d454..dbf055e 100644 --- a/drivers/crypto/cavium/cpt/Makefile +++ b/drivers/crypto/cavium/cpt/Makefile @@ -1,2 +1,3 @@ -obj-$(CONFIG_CAVIUM_CPT) += cptpf.o +obj-$(CONFIG_CAVIUM_CPT) += cptpf.o cptvf.o cptpf-objs := cptpf_main.o cptpf_mbox.o +cptvf-objs := cptvf_main.o cptvf_reqmanager.o cptvf_mbox.o cptvf_algs.o diff --git a/drivers/crypto/cavium/cpt/cptvf.h b/drivers/crypto/cavium/cpt/cptvf.h new file mode 100644 index 000..1cc04aa --- /dev/null +++ b/drivers/crypto/cavium/cpt/cptvf.h @@ -0,0 +1,135 @@ +/* + * Copyright (C) 2016 Cavium, Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + */ + +#ifndef __CPTVF_H +#define __CPTVF_H + +#include +#include "cpt_common.h" + +/* Default command queue length */ +#define CPT_CMD_QLEN 2046 +#define CPT_CMD_QCHUNK_SIZE 1023 + +/* Default command timeout in seconds */ +#define CPT_COMMAND_TIMEOUT 4 +#define CPT_TIMER_THOLD0x +#define CPT_NUM_QS_PER_VF 1 +#define CPT_INST_SIZE 64 +#define CPT_NEXT_CHUNK_PTR_SIZE 8 + +#defineCPT_VF_MSIX_VECTORS 2 +#define CPT_VF_INTR_MBOX_MASK BIT(0) +#define CPT_VF_INTR_DOVF_MASK BIT(1) +#define CPT_VF_INTR_IRDE_MASK BIT(2) +#define CPT_VF_INTR_NWRP_MASK BIT(3) +#define CPT_VF_INTR_SERR_MASK BIT(4) +#define DMA_DIRECT_DIRECT 0 /* Input DIRECT, Output DIRECT */ +#define DMA_GATHER_SCATTER 1 +#define FROM_DPTR 1 + +/** + * Enumeration cpt_vf_int_vec_e + * + * CPT VF MSI-X Vector Enumeration + * Enumerates the MSI-X interrupt vectors. + */ +enum cpt_vf_int_vec_e { + CPT_VF_INT_VEC_E_MISC = 0x00, + CPT_VF_INT_VEC_E_DONE = 0x01 +}; + +struct command_chunk { + u8 *head; + dma_addr_t dma_addr; + u32 size; /* Chunk size, max CPT_INST_CHUNK_MAX_SIZE */ + struct hlist_node nextchunk; +}; + +struct command_queue { + spinlock_t lock; /* command queue lock */ + u32 idx; /* Command queue host write idx */ + u32 nchunks; /* Number of command chunks */ + struct command_chunk *qhead;/* Command queue head, instructions +* are inserted here +*/ + struct hlist_head chead; +}; + +struct command_qinfo { + u32 cmd_size; + u32 qchunksize; /* Command queue chunk size */ + struct command_queue queue[CPT_NUM_QS_PER_VF]; +}; + +struct pending_entry { + u8 busy; /* Entry status (free/busy) */ + + volatile u64 *completion_addr; /* Completion address */ + void *post_arg; + void (*callback)(int, void *); /* Kernel ASYNC request callabck */ + void *callback_arg; /* Kernel ASYNC request callabck arg */ +}; + +struct pending_queue { + struct pending_entry *head; /* head of the queue */ + u32 front; /* Process work from here */ + u32 rear; /* Append new work here */ + atomic64_t pending_count; + spinlock_t lock; /* Queue lock */ +}; + +struct pending_qinfo { + u32 nr_queues; /* Number of queues supported */ + u32 qlen; /* Queue length */ + struct pending_queue queue[CPT_NUM_QS_PER_VF]; +}; + +#define for_each_pending_queue(qinfo, q, i)\ + for (i = 0, q = >queue[i]; i < qinfo->nr_queues; i++, \ +q = >queue[i]) + +struct cpt_vf { + u16 flags; /* Flags to hold device status bits */ + u8 vfid; /* Device Index 0...CPT_MAX_VF_NUM */ + u8 vftype; /* VF type of SE_TYPE(1) or AE_TYPE(1) */ + u8 vfgrp; /* VF group (0 - 8) */ + u8 node; /* Operating node: Bits (46:44) in BAR0 address */ + u8
[PATCH v3 1/3] drivers: crypto: Add Support for Octeon-tx CPT Engine
From: George CherianEnable the Physical Function diver for the Cavium Crypto Engine (CPT) found in Octeon-tx series of SoC's. CPT is the Cryptographic Acceleration Unit. CPT includes microcoded GigaCypher symmetric engines (SEs) and asymmetric engines (AEs). Signed-off-by: George Cherian Reviewed-by: David Daney --- drivers/crypto/cavium/cpt/Kconfig| 16 + drivers/crypto/cavium/cpt/Makefile | 2 + drivers/crypto/cavium/cpt/cpt_common.h | 158 +++ drivers/crypto/cavium/cpt/cpt_hw_types.h | 658 + drivers/crypto/cavium/cpt/cptpf.h| 69 +++ drivers/crypto/cavium/cpt/cptpf_main.c | 703 +++ drivers/crypto/cavium/cpt/cptpf_mbox.c | 163 +++ 7 files changed, 1769 insertions(+) create mode 100644 drivers/crypto/cavium/cpt/Kconfig create mode 100644 drivers/crypto/cavium/cpt/Makefile create mode 100644 drivers/crypto/cavium/cpt/cpt_common.h create mode 100644 drivers/crypto/cavium/cpt/cpt_hw_types.h create mode 100644 drivers/crypto/cavium/cpt/cptpf.h create mode 100644 drivers/crypto/cavium/cpt/cptpf_main.c create mode 100644 drivers/crypto/cavium/cpt/cptpf_mbox.c diff --git a/drivers/crypto/cavium/cpt/Kconfig b/drivers/crypto/cavium/cpt/Kconfig new file mode 100644 index 000..247f1cb --- /dev/null +++ b/drivers/crypto/cavium/cpt/Kconfig @@ -0,0 +1,16 @@ +# +# Cavium crypto device configuration +# + +config CRYPTO_DEV_CPT + tristate + +config CAVIUM_CPT + tristate "Cavium Cryptographic Accelerator driver" + depends on ARCH_THUNDER + select CRYPTO_DEV_CPT + help + Support for Cavium CPT block found in octeon-tx series of + processors. + + To compile this as a module, choose M here. diff --git a/drivers/crypto/cavium/cpt/Makefile b/drivers/crypto/cavium/cpt/Makefile new file mode 100644 index 000..fe3d454 --- /dev/null +++ b/drivers/crypto/cavium/cpt/Makefile @@ -0,0 +1,2 @@ +obj-$(CONFIG_CAVIUM_CPT) += cptpf.o +cptpf-objs := cptpf_main.o cptpf_mbox.o diff --git a/drivers/crypto/cavium/cpt/cpt_common.h b/drivers/crypto/cavium/cpt/cpt_common.h new file mode 100644 index 000..ede612f --- /dev/null +++ b/drivers/crypto/cavium/cpt/cpt_common.h @@ -0,0 +1,158 @@ +/* + * Copyright (C) 2016 Cavium, Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + */ + +#ifndef __CPT_COMMON_H +#define __CPT_COMMON_H + +#include +#include +#include + +#include "cpt_hw_types.h" + +/* Device ID */ +#define CPT_81XX_PCI_PF_DEVICE_ID 0xa040 +#define CPT_81XX_PCI_VF_DEVICE_ID 0xa041 + +/* flags to indicate the features supported */ +#define CPT_FLAG_MSIX_ENABLED BIT(0) +#define CPT_FLAG_SRIOV_ENABLED BIT(1) +#define CPT_FLAG_VF_DRIVER BIT(2) +#define CPT_FLAG_DEVICE_READY BIT(3) + +#define cpt_msix_enabled(cpt) ((cpt)->flags & CPT_FLAG_MSIX_ENABLED) +#define cpt_sriov_enabled(cpt) ((cpt)->flags & CPT_FLAG_SRIOV_ENABLED) +#define cpt_vf_driver(cpt) ((cpt)->flags & CPT_FLAG_VF_DRIVER) +#define cpt_device_ready(cpt) ((cpt)->flags & CPT_FLAG_DEVICE_READY) + +#define CPT_MBOX_MSG_TYPE_ACK 1 +#define CPT_MBOX_MSG_TYPE_NACK 2 +#define CPT_MBOX_MSG_TIMEOUT 2000 +#define VF_STATE_DOWN 0 +#define VF_STATE_UP 1 + +/* + * CPT Registers map for 81xx + */ + +/* PF registers */ +#define CPTX_PF_CONSTANTS(a) (0x0ll + ((u64)(a) << 36)) +#define CPTX_PF_RESET(a) (0x100ll + ((u64)(a) << 36)) +#define CPTX_PF_DIAG(a) (0x120ll + ((u64)(a) << 36)) +#define CPTX_PF_BIST_STATUS(a) (0x160ll + ((u64)(a) << 36)) +#define CPTX_PF_ECC0_CTL(a) (0x200ll + ((u64)(a) << 36)) +#define CPTX_PF_ECC0_FLIP(a) (0x210ll + ((u64)(a) << 36)) +#define CPTX_PF_ECC0_INT(a) (0x220ll + ((u64)(a) << 36)) +#define CPTX_PF_ECC0_INT_W1S(a) (0x230ll + ((u64)(a) << 36)) +#define CPTX_PF_ECC0_ENA_W1S(a)(0x240ll + ((u64)(a) << 36)) +#define CPTX_PF_ECC0_ENA_W1C(a)(0x250ll + ((u64)(a) << 36)) +#define CPTX_PF_MBOX_INTX(a, b)\ + (0x400ll + ((u64)(a) << 36) + ((b) << 3)) +#define CPTX_PF_MBOX_INT_W1SX(a, b) \ + (0x420ll + ((u64)(a) << 36) + ((b) << 3)) +#define CPTX_PF_MBOX_ENA_W1CX(a, b) \ + (0x440ll + ((u64)(a) << 36) + ((b) << 3)) +#define CPTX_PF_MBOX_ENA_W1SX(a, b) \ + (0x460ll + ((u64)(a) << 36) + ((b) << 3)) +#define CPTX_PF_EXEC_INT(a) (0x500ll + 0x10ll * ((a) & 0x1)) +#define CPTX_PF_EXEC_INT_W1S(a)(0x520ll + ((u64)(a) << 36)) +#define CPTX_PF_EXEC_ENA_W1C(a)(0x540ll + ((u64)(a) << 36)) +#define CPTX_PF_EXEC_ENA_W1S(a)(0x560ll + ((u64)(a) << 36)) +#define CPTX_PF_GX_EN(a, b) \ + (0x600ll + ((u64)(a) << 36) + ((b) << 3)) +#define CPTX_PF_EXEC_INFO(a) (0x700ll + ((u64)(a) << 36)) +#define CPTX_PF_EXEC_BUSY(a) (0x800ll + ((u64)(a) << 36)) +#define CPTX_PF_EXEC_INFO0(a) (0x900ll + ((u64)(a) << 36))
[PATCH v3 0/3] Add Support for Cavium Cryptographic Acceleration Unit
From: George CherianThis series adds the support for Cavium Cryptographic Accelerarion Unit (CPT) CPT is available in Cavium's Octeon-Tx SoC series. The series was tested with ecryptfs and dm-crypt for in kernel cryptographic offload operations. This driver needs a firmware to work, I will be sending the firmware to linux-firmware once the driver is accepted. Changes v2 -> v3 -- Addressed David Daney's comments - There is not much difference in performance readq/writeq vs readq_relaxed/writeq_relaxed, so switching to readq/writeq variant. - Removed the useless bitfield definitions. - Use GENMASK,dev_to_node() instead of custome functions. - Use module_pci_driver instead of module_init/exit. Changes v1 -> v2 -- Addressed a crash issue when more gather components are passed. -- Redo the cptvf request manager. - Get rid of the un necessary buffer copies. -- s/uint*_t/u* -- Remove unwanted Macro definitions -- Remove the redundant ROUNDUP* macros and use kernel function -- Select proper config option in Kconfig file. -- Removed some of the unwanted header file inclusions -- Miscellaneous Cleanup George Cherian (3): drivers: crypto: Add Support for Octeon-tx CPT Engine drivers: crypto: Add the Virtual Function driver for CPT drivers: crypto: Enable CPT options crypto for build drivers/crypto/Kconfig | 1 + drivers/crypto/Makefile | 1 + drivers/crypto/cavium/cpt/Kconfig| 16 + drivers/crypto/cavium/cpt/Makefile | 3 + drivers/crypto/cavium/cpt/cpt_common.h | 158 + drivers/crypto/cavium/cpt/cpt_hw_types.h | 658 +++ drivers/crypto/cavium/cpt/cptpf.h| 69 ++ drivers/crypto/cavium/cpt/cptpf_main.c | 703 drivers/crypto/cavium/cpt/cptpf_mbox.c | 163 + drivers/crypto/cavium/cpt/cptvf.h| 135 drivers/crypto/cavium/cpt/cptvf_algs.c | 424 drivers/crypto/cavium/cpt/cptvf_algs.h | 110 drivers/crypto/cavium/cpt/cptvf_main.c | 945 +++ drivers/crypto/cavium/cpt/cptvf_mbox.c | 205 ++ drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 581 drivers/crypto/cavium/cpt/request_manager.h | 147 + 16 files changed, 4319 insertions(+) create mode 100644 drivers/crypto/cavium/cpt/Kconfig create mode 100644 drivers/crypto/cavium/cpt/Makefile create mode 100644 drivers/crypto/cavium/cpt/cpt_common.h create mode 100644 drivers/crypto/cavium/cpt/cpt_hw_types.h create mode 100644 drivers/crypto/cavium/cpt/cptpf.h create mode 100644 drivers/crypto/cavium/cpt/cptpf_main.c create mode 100644 drivers/crypto/cavium/cpt/cptpf_mbox.c create mode 100644 drivers/crypto/cavium/cpt/cptvf.h create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.c create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.h create mode 100644 drivers/crypto/cavium/cpt/cptvf_main.c create mode 100644 drivers/crypto/cavium/cpt/cptvf_mbox.c create mode 100644 drivers/crypto/cavium/cpt/cptvf_reqmanager.c create mode 100644 drivers/crypto/cavium/cpt/request_manager.h -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html