Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Jason A. Donenfeld
Hi Ted,

On Thu, Dec 22, 2016 at 6:41 AM, Theodore Ts'o  wrote:
> The bottom line is that I think we're really "pixel peeping" at this
> point --- which is what obsessed digital photographers will do when
> debating the quality of a Canon vs Nikon DSLR by blowing up a photo by
> a thousand times, and then trying to claim that this is visible to the
> human eye.  Or people who obsessing over the frequency response curves
> of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's
> likely that in a blind head-to-head comparison, most people wouldn't
> be able to tell the difference

This is hilarious, thanks for the laugh. I believe you're right about this...

>
> I think the main argument for using the batched getrandom approach is
> that it, I would argue, simpler than introducing siphash into the
> picture.  On 64-bit platforms it is faster and more consistent, so
> it's basically that versus complexity of having to adding siphash to
> the things that people have to analyze when considering random number
> security on Linux.   But it's a close call either way, I think.

I find this compelling. We'll have one csprng for both
get_random_int/long and for /dev/urandom, and we'll be able to update
that silly warning on the comment of get_random_int/long to read
"gives output of either rdrand quality or of /dev/urandom quality",
which makes it more useful for other things. It introduces less error
prone code, and it lets the RNG analysis be spent on just one RNG, not
two.

So, with your blessing, I'm going to move ahead with implementing a
pretty version of this for v8.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 9:01 PM, George Spelvin
 wrote:
> Andy Lutomirski wrote:
>> I don't even think it needs that.  This is just adding a
>> non-destructive final operation, right?
>
> It is, but the problem is that SipHash is intended for *small* inputs,
> so the standard implementations aren't broken into init/update/final
> functions.
>
> There's just one big function that keeps the state variables in
> registers and never stores them anywhere.
>
> If we *had* init/update/final functions, then it would be trivial.
>
>> Just to clarify, if we replace SipHash with a black box, I think this
>> effectively means, where "entropy" is random_get_entropy() || jiffies
>> || current->pid:
>
>> The first call returns H(random seed || entropy_0 || secret).  The
>> second call returns H(random seed || entropy_0 || secret || entropy_1
>> || secret).  Etc.
>
> Basically, yes.  I was skipping the padding byte and keying the
> finalization rounds on the grounds of "can't hurt and might help",
> but we could do it a more standard way.
>
>> If not, then I have a fairly strong preference to keep whatever
>> construction we come up with consistent with something that could
>> actually happen with invocations of unmodified SipHash -- then all the
>> security analysis on SipHash goes through.
>
> Okay.  I don't think it makes a difference, but it's not a *big* waste
> of time.  If we have finalization rounds, we can reduce the secret
> to 128 bits.
>
> If we include the padding byte, we can do one of two things:
> 1) Make the secret 184 bits, to fill up the final partial word as
>much as possible, or
> 2) Make the entropy 1 byte smaller and conceptually misalign the
>secret.  What we'd actually do is remove the last byte of
>the secret and include it in the entropy words, but that's
>just a rotation of the secret between storage and hashing.
>
> Also, I assume you'd like SipHash-2-4, since you want to rely
> on a security analysis.

I haven't looked, but I assume that the analysis at least thought
about reduced rounds, so maybe other variants are okay.

>> The one thing I don't like is
>> that I don't see how to prove that you can't run it backwards if you
>> manage to acquire a memory dump.  In fact, I that that there exist, at
>> least in theory, hash functions that are secure in the random oracle
>> model but that *can* be run backwards given the full state.  From
>> memory, SHA-3 has exactly that property, and it would be a bit sad for
>> a CSPRNG to be reversible.
>
> Er...  get_random_int() is specifically *not* designed to be resistant
> to state capture, and I didn't try.  Remember, what it's used for
> is ASLR, what we're worried about is somene learning the layouts
> of still-running processes, and and if you get a memory dump, you have
> the memory layout!

True, but it's called get_random_int(), and it seems like making it
stronger, especially if the performance cost is low to zero, is a good
thing.

>
> If you want anti-backtracking, though, it's easy to add.  What we
> hash is:
>
> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>
> You mix the output word right back in to the (unfinalized) state after
> generating it.  This is still equivalent to unmodified back-box SipHash,
> you're just using a (conceptually independent) SipHash invocation to
> produce some of its input.

Ah, cute.  This could probably be sped up by doing something like:

entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

It's a little weak because the output is only 64 bits, so you could
plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
old entropy is guessable.  I suspect there are sneaky ways around it
like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

>
> The only remaining issues are:
> 1) How many rounds, and
> 2) May we use HalfSipHash?

I haven't looked closely enough to have a real opinion here.  I don't
know what the security margin is believed to be.

>
> I'd *like* to persuade you that skipping the padding byte wouldn't
> invalidate any security proofs, because it's true and would simplify
> the code.  But if you want 100% stock, I'm willing to cater to that.

I lean toward stock in the absence of a particularly good reason.  At
the very least I'd want to read that paper carefully.

>
> Ted, what do you think?



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Theodore Ts'o
On Thu, Dec 22, 2016 at 03:49:39AM +0100, Jason A. Donenfeld wrote:
> 
> Funny -- while you guys were sending this back & forth, I was writing
> my reply to Andy which essentially arrives at the same conclusion.
> Given that we're all arriving to the same thing, and that Ted shot in
> this direction long before we all did, I'm leaning toward abandoning
> SipHash for the de-MD5-ification of get_random_int/long, and working
> on polishing Ted's idea into something shiny for this patchset.

here are my numbers comparing siphash (using the first three patches
of the v7 siphash patches) with my batched chacha20 implementation.
The results are taken by running get_random_* 1 times, and then
dividing the numbers by 1 to get the average number of cycles for
the call.  I compiled 32-bit and 64-bit kernels, and ran the results
using kvm:

   siphashbatched chacha20
 get_random_int  get_random_long   get_random_int   get_random_long   

32-bit270  278 114146
64-bit 75   75 106186

> I did have two objections to this. The first was that my SipHash
> construction is faster.

Well, it's faster on everything except 32-bit x86.  :-P

> The second, and the more
> important one, was that batching entropy up like this means that 32
> calls will be really fast, and then the 33rd will be slow, since it
> has to do a whole ChaCha round, because get_random_bytes must be
> called to refill the batch.

... and this will take 2121 cycles on 64-bit x86, and 2315 cycles on a
32-bit x86.  Which on a 2.3 GHz processor, is just under a
microsecond.  As far as being inconsistent on process startup, I very
much doubt a microsecond is really going to be visible to the user.  :-)

The bottom line is that I think we're really "pixel peeping" at this
point --- which is what obsessed digital photographers will do when
debating the quality of a Canon vs Nikon DSLR by blowing up a photo by
a thousand times, and then trying to claim that this is visible to the
human eye.  Or people who obsessing over the frequency response curves
of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's
likely that in a blind head-to-head comparison, most people wouldn't
be able to tell the difference

I think the main argument for using the batched getrandom approach is
that it, I would argue, simpler than introducing siphash into the
picture.  On 64-bit platforms it is faster and more consistent, so
it's basically that versus complexity of having to adding siphash to
the things that people have to analyze when considering random number
security on Linux.   But it's a close call either way, I think.

  - Ted

P.S.  My benchmarking code

diff --git a/drivers/char/random.c b/drivers/char/random.c
index a51f0ff43f00..41860864b775 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1682,6 +1682,55 @@ static int rand_initialize(void)
 }
 early_initcall(rand_initialize);
 
+static unsigned int get_random_int_new(void);
+static unsigned long get_random_long_new(void);
+
+#define NUM_CYCLES 1
+#define AVG(finish, start) ((unsigned int)(finish - start + NUM_CYCLES/2) / 
NUM_CYCLES)
+
+static int rand_benchmark(void)
+{
+   cycles_t start,finish;
+   int i, out;
+
+   pr_crit("random benchmark!!\n");
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_int();}
+   finish = get_cycles();
+   pr_err("get_random_int # cycles: %u\n", AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_int_new();
+   }
+   finish = get_cycles();
+   pr_err("get_random_int_new (batched chacha20) # cycles: %u\n", 
AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_long();
+   }
+   finish = get_cycles();
+   pr_err("get_random_long # cycles: %u\n", AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_long_new();
+   }
+   finish = get_cycles();
+   pr_err("get_random_long_new (batched chacha20) # cycles: %u\n", 
AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_bytes(, sizeof(out));
+   }
+   finish = get_cycles();
+   pr_err("get_random_bytes # cycles: %u\n", AVG(finish, start));
+   return 0;
+}
+device_initcall(rand_benchmark);
+
 #ifdef CONFIG_BLOCK
 void rand_initialize_disk(struct gendisk *disk)
 {
@@ -2064,8 +2113,10 @@ unsigned int get_random_int(void)
unsigned int ret;
u64 *chaining;
 
+#if 0  // force slow path
if (arch_get_random_int())
return ret;
+#endif
 
chaining = _cpu_var(get_random_int_chaining);
ret = *chaining = 

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread George Spelvin
Andy Lutomirski wrote:
> I don't even think it needs that.  This is just adding a
> non-destructive final operation, right?

It is, but the problem is that SipHash is intended for *small* inputs,
so the standard implementations aren't broken into init/update/final
functions.

There's just one big function that keeps the state variables in
registers and never stores them anywhere.

If we *had* init/update/final functions, then it would be trivial.

> Just to clarify, if we replace SipHash with a black box, I think this
> effectively means, where "entropy" is random_get_entropy() || jiffies
> || current->pid:

> The first call returns H(random seed || entropy_0 || secret).  The
> second call returns H(random seed || entropy_0 || secret || entropy_1
> || secret).  Etc.

Basically, yes.  I was skipping the padding byte and keying the
finalization rounds on the grounds of "can't hurt and might help",
but we could do it a more standard way.

> If not, then I have a fairly strong preference to keep whatever
> construction we come up with consistent with something that could
> actually happen with invocations of unmodified SipHash -- then all the
> security analysis on SipHash goes through.

Okay.  I don't think it makes a difference, but it's not a *big* waste
of time.  If we have finalization rounds, we can reduce the secret
to 128 bits.

If we include the padding byte, we can do one of two things:
1) Make the secret 184 bits, to fill up the final partial word as
   much as possible, or
2) Make the entropy 1 byte smaller and conceptually misalign the
   secret.  What we'd actually do is remove the last byte of
   the secret and include it in the entropy words, but that's
   just a rotation of the secret between storage and hashing.

Also, I assume you'd like SipHash-2-4, since you want to rely
on a security analysis.

(Regarding the padding byte, getting it right might be annoying
to do exactly.  All of the security analysis depends *only* on
its low 3 bits indicating how much of the final block is used.
As it says in the SipHash paper, they included 8 bits just because
it was easy.  But if you want it exact, it's just one more byte of
state.)

> The one thing I don't like is
> that I don't see how to prove that you can't run it backwards if you
> manage to acquire a memory dump.  In fact, I that that there exist, at
> least in theory, hash functions that are secure in the random oracle
> model but that *can* be run backwards given the full state.  From
> memory, SHA-3 has exactly that property, and it would be a bit sad for
> a CSPRNG to be reversible.

Er...  get_random_int() is specifically *not* designed to be resistant
to state capture, and I didn't try.  Remember, what it's used for
is ASLR, what we're worried about is somene learning the layouts
of still-running processes, and and if you get a memory dump, you have
the memory layout!

If you want anti-backtracking, though, it's easy to add.  What we
hash is:

entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...

You mix the output word right back in to the (unfinalized) state after
generating it.  This is still equivalent to unmodified back-box SipHash,
you're just using a (conceptually independent) SipHash invocation to
produce some of its input.

Each output is produced by copying the state, padding & finalizing after the
secret.


In fact, to make our lives easier, let's define the secret to end with
a counter byte that happens to be equal to the padding byte.  The input
stream will be:

Previous output: 8 (or 4 for HalfSipHash) bytes
Entropy: 15 bytes (8 bytes timer, 4 bytes jiffies, 3 bytes pid)
Secret: 16 bytes
Counter: 1 byte
...repeat...

> We could also periodically mix in a big (128-bit?) chunk of fresh
> urandom output to keep the bad guys guessing.

Simpler and faster to just update the global master secret.
The state is per-CPU, so mixing in has to be repeated per CPU.


With these changes, I'm satisifed that it's secure, cheap, has a
sufficiently wide state size, *and* all standard SipHash analysis applies.

The only remaining issues are:
1) How many rounds, and
2) May we use HalfSipHash?

I'd *like* to persuade you that skipping the padding byte wouldn't
invalidate any security proofs, because it's true and would simplify
the code.  But if you want 100% stock, I'm willing to cater to that.

Ted, what do you think?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld
Hi George,

On Thu, Dec 22, 2016 at 4:55 AM, George Spelvin
 wrote:
> Do we have to go through this?  No, the benchmark was *not* bogus.
> Then I replaced the kernel #includes with the necessary typedefs
> and #defines to make it compile in user-space.
> * I didn't iterate 100K times, I timed the functions *once*.
> * I saved the times in a buffer and printed them all at the end
>   so printf() wouldn't pollute the caches.
> * Before every even-numbered iteration, I flushed the I-cache
>   of everything from _init to _fini (i.e. all the non-library code).
>   This cold-cache case is what is going to happen in the kernel.

Wow! Great. Thanks for the pointers on the right way to do this. Very
helpful, and enlightening results indeed. Think you could send me the
whole .c of what you finally came up with? I'd like to run this on
some other architectures; I've got a few odd boxes laying around here.

> The P4 results were:
> SipHash actually wins slightly in the cold-cache case, because
> it iterates more.  In the hot-cache case, it loses
> Core 2 duo:
> Pretty much a tie, honestly.
> Ivy Bridge:
> Modern processors *hate* cold caches.  But notice how md5 is *faster*
> than SipHash on hot-cache IPv6.
> Ivy Bridge, -m64:
> Of course, when you compile -m64, SipHash is unbeatable.

Okay, so I think these results are consistent with some of the
assessments from before -- that SipHash is really just fine as a
replacement for MD5. Not great on older 32-bit x86, but not too
horrible, and the performance improvements on every other architecture
and the security improvements everywhere are a net good.

> Here's the modified benchmark() code.  The entire package is
> a bit voluminous for the mailing list, but anyone is welcome to it.

Please do send! I'm sure I'll learn from reading it. Thanks again for
doing the hardwork of putting something proper together.

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this?  No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark.  I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space.  (Voluminous but
straightforward.)  E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code.  The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
  so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
  of everything from _init to _fini (i.e. all the non-library code).
  This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test.  The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold: 40843488358435843568
tcpv4 md5 cold: 1052 996 9961060 996
tcpv6 siphash cold: 40803296331232963312
tcpv4 siphash cold: 29682748297227162716
tcpv6 md5 hot:   900 712 712712  712
tcpv4 md5 hot:   632 672 672672  672
tcpv6 siphash hot:  24842292234023402340
tcpv4 siphash hot:  16601560156423401564

SipHash actually wins slightly in the cold-cache case, because
it iterates more.  In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold: 33962868296430122832
tcpv4 md5 cold: 13681044132013321308
tcpv6 siphash cold: 29402952291624482604
tcpv4 siphash cold: 31922988357635043624
tcpv6 md5 hot:  11161032 99610081008
tcpv4 md5 hot:   936 936 936 936 936
tcpv6 siphash hot:  12001236123611881188
tcpv4 siphash hot:   936 804 804 804 804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold: 60866136696263586060
tcpv4 md5 cold:  816 732104610541012
tcpv6 siphash cold: 37561886215223902566
tcpv4 siphash cold: 32642108302631203526
tcpv6 md5 hot:  1062 808 824 824 832
tcpv4 md5 hot:   730 730 740 748 748
tcpv6 siphash hot:   960 952 9361112 926
tcpv4 siphash hot:   638 544 562 552 560

Modern processors *hate* cold caches.  But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold: 46803672395636163525
tcpv4 md5 cold: 10661416117911791134
tcpv6 siphash cold:  9401258199516092255
tcpv4 siphash cold: 14401269129218701621
tcpv6 md5 hot:  1372112210881088
tcpv4 md5 hot:   997 997 997 997 998
tcpv6 siphash hot:   340 340 340 352 340
tcpv4 siphash hot:   227 238 238 238 238

Of course, when you compile -m64, SipHash is unbeatable.


Here's the modified benchmark() code.  The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
extern char const _init, _fini;
char const *p = &_init;

while (p < &_fini) {
asm("clflush %0" : : "m" (*p));
p += 64;
}
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
uint32_t eax, edx;
asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
return eax;
}

static int benchmark(void)
{
cycles_t start, finish;
int i;
u32 seq_number = 0;
__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 
222, 0xff10 };
__be32 saddr4 = 2, daddr4 = 182112;
__be16 sport = 22, dport = 41992;

Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Jason A. Donenfeld
On Thu, Dec 22, 2016 at 3:49 AM, Jason A. Donenfeld  wrote:
> I did have two objections to this. The first was that my SipHash
> construction is faster. But in any case, they're both faster than the
> current MD5, so it's just extra rice. The second, and the more
> important one, was that batching entropy up like this means that 32
> calls will be really fast, and then the 33rd will be slow, since it
> has to do a whole ChaCha round, because get_random_bytes must be
> called to refill the batch. Since get_random_long is called for every
> process startup, I didn't really like there being inconsistent
> performance on process startup. And I'm pretty sure that one ChaCha
> whole block is slower than computing MD5, even though it lasts 32
> times as long, though I need to measure this. But maybe that's dumb in
> the end? Are these concerns that should point us toward the
> determinism (and speed) of SipHash? Are these concerns that don't
> matter and so we should roll with the simplicity of reusing ChaCha?

I ran some measurements in order to quantify what I'm talking about.
Repeatedly running md5_transform is about 2.3 times faster than
repeatedly running extract_crng. What does this mean?

One call to extract_crng gives us 32 times as many longs as one call
to md5_transform. This means that spread over 32 process creations,
chacha will be 13.9 times faster. However, every 32nd process will
take 2.3 times as long to generate its ASLR value as it would with the
old md5_transform code.

Personally, I don't think that 2.3 is a big deal. And I really like
how much this simplifies the analysis.
But if it's a big deal to you, then we can continue to discuss my
SipHash construction, which gives faster and more consistent
performance, at the cost of a more complicated and probably less
impressive security analysis.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Jason A. Donenfeld
Hi Andy & Hannes,

On Thu, Dec 22, 2016 at 3:07 AM, Hannes Frederic Sowa
 wrote:
> I wonder if Ted's proposal was analyzed further in terms of performance
> if get_random_int should provide cprng alike properties?
>
> For reference: https://lkml.org/lkml/2016/12/14/351
>
> The proposal made sense to me and would completely solve the above
> mentioned problem on the cost of repeatedly reseeding from the crng.

On Thu, Dec 22, 2016 at 3:09 AM, Andy Lutomirski  wrote:
> Unless I've misunderstood it, Ted's proposal causes get_random_int()
> to return bytes straight from urandom (effectively), which should make
> it very strong.  And if urandom is competitively fast now, I don't see
> the problem.  ChaCha20 is designed for speed, after all.

Funny -- while you guys were sending this back & forth, I was writing
my reply to Andy which essentially arrives at the same conclusion.
Given that we're all arriving to the same thing, and that Ted shot in
this direction long before we all did, I'm leaning toward abandoning
SipHash for the de-MD5-ification of get_random_int/long, and working
on polishing Ted's idea into something shiny for this patchset.

I did have two objections to this. The first was that my SipHash
construction is faster. But in any case, they're both faster than the
current MD5, so it's just extra rice. The second, and the more
important one, was that batching entropy up like this means that 32
calls will be really fast, and then the 33rd will be slow, since it
has to do a whole ChaCha round, because get_random_bytes must be
called to refill the batch. Since get_random_long is called for every
process startup, I didn't really like there being inconsistent
performance on process startup. And I'm pretty sure that one ChaCha
whole block is slower than computing MD5, even though it lasts 32
times as long, though I need to measure this. But maybe that's dumb in
the end? Are these concerns that should point us toward the
determinism (and speed) of SipHash? Are these concerns that don't
matter and so we should roll with the simplicity of reusing ChaCha?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread Jason A. Donenfeld
> On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
>> After some thinking, I still like the "state-preserving" construct
>> that's equivalent to the current MD5 code.  Yes, we could just do
>> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
>> preserve a bit more.
>>
>> It requires library support from the SipHash code to return the full
>> SipHash state, but I hope that's a fair thing to ask for.

This is not a good idea. If I understand correctly, the idea here is
to just keep around SipHash's internal state variables, and chain them
over to the next call, sort of like how md5_transform with the current
code works on the same scratch space. There has been no security
analysis in the literature on this use of the primitive, and I have no
confidence that this is a secure use of the function. Unless somebody
can point me toward a paper I missed or a comment from a real
cryptographer about the specifics of SipHash, I think I'm right to
admonish against this dangerous road.

Let's talk about constructions. And let's only decide on a
construction that we're actually equipped to analyze. Let's definitely
not talk about making our own primitives, or retrofitting nice
primitive's internals into our own Frankenstein.

Alternatively, if I'm wrong, please send me an eprint/arXiv link to a
paper that discusses this use of SipHash.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Jason A. Donenfeld
Hi Andy,

On Thu, Dec 22, 2016 at 12:42 AM, Andy Lutomirski  wrote:
> So this is probably good enough, and making it better is hard.  Changing it 
> to:
>
> u64 entropy = (u64)random_get_entropy() + current->pid;
> result = siphash(..., entropy, ...);
> secret->chaining += result + entropy;
>
> would reduce this problem by forcing an attacker to brute-force the
> entropy on each iteration, which is probably an improvement.

Ahh, so that's the reasoning behind a similar suggestion of yours in a
previous email. Makes sense to me. I'll include this in the next merge
if we don't come up with a different idea before then. Your reasoning
seems good for it.

Part of what makes this process a bit goofy is that it's not all
together clear what the design goals are. Right now we're going for
"not worse than before", which we've nearly achieved. How good of an
RNG do we want? I'm willing to examine and analyze the security and
performance of all constructions we can come up with. One thing I
don't want to do, however, is start tweaking the primitives themselves
in ways not endorsed by the designers. So, I believe that precludes
things like carrying over SipHash's internal state (like what was done
with MD5), because there hasn't been a formal security analysis of
this like there has with other uses of SipHash. I also don't want to
change any internals of how SipHash actually works. I mention that
because of some of the suggestions on other threads, which make me
rather uneasy.

So with that said, while writing this reply to you, I was
simultaneously reading some other crypto code and was reminded that
there's a variant of SipHash which outputs an additional 64-bits; it's
part of the siphash reference code, which they call the "128-bit
mode". It has the benefit that we can return 64-bits to the caller and
save 64-bits for the chaining key. That way there's no correlation
between the returned secret and the chaining key, which I think would
completely alleviate all of your concerns, and simplify the analysis a
bit.

Here's what it looks like:
https://git.zx2c4.com/linux-dev/commit/?h=siphash=46fbe5b408e66b2d16b4447860f8083480e1c08d

The downside is that it takes 4 extra Sip rounds. This puts the
performance still better than MD5, though, and likely still better
than the other batched entropy solution. We could optimize this, I
suppose, by giving it only two parameters -- chaining,
jiffies+entropy+pid -- instead of the current three -- chaining,
jiffies, entropy+pid -- which would then shave off 2 Sip rounds. But I
liked the idea of having a bit more spread in the entropy input field.

Anyway, with this in mind, we now have three possibilities:

1. result = siphash(chaining, entropy, key); chaining += result + entropy
2. result = siphash_extra_output(chaining, entropy, key, );
3. Ted's batched entropy idea using chacha20

The more I think about this, the more I suspect that we should just
use chacha20. It will still be faster than MD5. I don't like the
non-determinism of it (some processes will start slower than others,
if the batched entropy has run out and ASLR demands more), but I guess
I can live with that. But, most importantly, it greatly simplifies
both the security analysis and what we can promise to callers about
the function. Right now in the comment documentation, we're coy with
callers about the security of the RNG. If we moved to a known
construction like chacha20/get_random_bytes_batched, then we could
just be straight up with a promise that the numbers it returns are
high quality.

Thoughts on 2 and 3, and on 1 vs 2 vs 3?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 6:07 PM, Hannes Frederic Sowa
 wrote:
> On 22.12.2016 00:42, Andy Lutomirski wrote:
>> On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld  wrote:
>>>  unsigned int get_random_int(void)
>>>  {
>>> -   __u32 *hash;
>>> -   unsigned int ret;
>>> -
>>> -   if (arch_get_random_int())
>>> -   return ret;
>>> -
>>> -   hash = get_cpu_var(get_random_int_hash);
>>> -
>>> -   hash[0] += current->pid + jiffies + random_get_entropy();
>>> -   md5_transform(hash, random_int_secret);
>>> -   ret = hash[0];
>>> -   put_cpu_var(get_random_int_hash);
>>> -
>>> -   return ret;
>>> +   unsigned int arch_result;
>>> +   u64 result;
>>> +   struct random_int_secret *secret;
>>> +
>>> +   if (arch_get_random_int(_result))
>>> +   return arch_result;
>>> +
>>> +   secret = get_random_int_secret();
>>> +   result = siphash_3u64(secret->chaining, jiffies,
>>> + (u64)random_get_entropy() + current->pid,
>>> + secret->secret);
>>> +   secret->chaining += result;
>>> +   put_cpu_var(secret);
>>> +   return result;
>>>  }
>>>  EXPORT_SYMBOL(get_random_int);
>>
>> Hmm.  I haven't tried to prove anything for real.  But here goes (in
>> the random oracle model):
>>
>> Suppose I'm an attacker and I don't know the secret or the chaining
>> value.  Then, regardless of what the entropy is, I can't predict the
>> numbers.
>>
>> Now suppose I do know the secret and the chaining value due to some
>> leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
>> to find a value "result" such that prev_chaining + result = chaining
>> and result = H(prev_chaining, ..., secret);.  I don't think this can
>> be done efficiently in the random oracle model regardless of what the
>> "..." is.
>>
>> But, if I know the secret and chaining value, I can predict the next
>> output assuming I can guess the entropy.  What's worse is that, even
>> if I can't guess the entropy, if I *observe* the next output then I
>> can calculate the next chaining value.
>>
>> So this is probably good enough, and making it better is hard.  Changing it 
>> to:
>>
>> u64 entropy = (u64)random_get_entropy() + current->pid;
>> result = siphash(..., entropy, ...);
>> secret->chaining += result + entropy;
>>
>> would reduce this problem by forcing an attacker to brute-force the
>> entropy on each iteration, which is probably an improvement.
>>
>> To fully fix it, something like "catastrophic reseeding" would be
>> needed, but that's hard to get right.
>
> I wonder if Ted's proposal was analyzed further in terms of performance
> if get_random_int should provide cprng alike properties?
>
> For reference: https://lkml.org/lkml/2016/12/14/351
>
> The proposal made sense to me and would completely solve the above
> mentioned problem on the cost of repeatedly reseeding from the crng.
>

Unless I've misunderstood it, Ted's proposal causes get_random_int()
to return bytes straight from urandom (effectively), which should make
it very strong.  And if urandom is competitively fast now, I don't see
the problem.  ChaCha20 is designed for speed, after all.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
 wrote:
> As a separate message, to disentangle the threads, I'd like to
> talk about get_random_long().
>
> After some thinking, I still like the "state-preserving" construct
> that's equivalent to the current MD5 code.  Yes, we could just do
> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
> preserve a bit more.
>
> It requires library support from the SipHash code to return the full
> SipHash state, but I hope that's a fair thing to ask for.

I don't even think it needs that.  This is just adding a
non-destructive final operation, right?

>
> Here's my current straw man design for comment.  It's very similar to
> the current MD5-based design, but feeds all the seed material in the
> "correct" way, as opposed to Xring directly into the MD5 state.
>
> * Each CPU has a (Half)SipHash state vector,
>   "unsigned long get_random_int_hash[4]".  Unlike the current
>   MD5 code, we take care to initialize it to an asymmetric state.
>
> * There's a global 256-bit random_int_secret (which we could
>   reseed periodically).
>
> To generate a random number:
> * If get_random_int_hash is all-zero, seed it with fresh a half-sized
>   SipHash key and the appropriate XOR constants.
> * Generate three words of random_get_entropy(), jiffies, and current->pid.
>   (This is arbitary seed material, copied from the current code.)
> * Crank through that with (Half)SipHash-1-0.
> * Crank through the random_int_secret with (Half)SipHash-1-0.
> * Return v1 ^ v3.

Just to clarify, if we replace SipHash with a black box, I think this
effectively means, where "entropy" is random_get_entropy() || jiffies
|| current->pid:

The first call returns H(random seed || entropy_0 || secret).  The
second call returns H(random seed || entropy_0 || secret || entropy_1
|| secret).  Etc.

If not, then I have a fairly strong preference to keep whatever
construction we come up with consistent with something that could
actually happen with invocations of unmodified SipHash -- then all the
security analysis on SipHash goes through.

Anyway, I have mixed thoughts about the construction.  It manages to
have a wide state at essentially no cost, which buys us quite a bit of
work factor to break it.  Even with full knowledge of the state, an
output doesn't reveal the entropy except to the extent that it can be
brute-force (this is just whatever the appropriate extended version of
first preimage resistance gives us).  The one thing I don't like is
that I don't see how to prove that you can't run it backwards if you
manage to acquire a memory dump.  In fact, I that that there exist, at
least in theory, hash functions that are secure in the random oracle
model but that *can* be run backwards given the full state.  From
memory, SHA-3 has exactly that property, and it would be a bit sad for
a CSPRNG to be reversible.

We could also periodically mix in a big (128-bit?) chunk of fresh
urandom output to keep the bad guys guessing.

(P.S.  This kind of resembles the duplex sponge construction.  If
hardware SHA-3 ever shows up, a duplex sponge RNG might nice indeed.)
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Hannes Frederic Sowa
On 22.12.2016 00:42, Andy Lutomirski wrote:
> On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld  wrote:
>>  unsigned int get_random_int(void)
>>  {
>> -   __u32 *hash;
>> -   unsigned int ret;
>> -
>> -   if (arch_get_random_int())
>> -   return ret;
>> -
>> -   hash = get_cpu_var(get_random_int_hash);
>> -
>> -   hash[0] += current->pid + jiffies + random_get_entropy();
>> -   md5_transform(hash, random_int_secret);
>> -   ret = hash[0];
>> -   put_cpu_var(get_random_int_hash);
>> -
>> -   return ret;
>> +   unsigned int arch_result;
>> +   u64 result;
>> +   struct random_int_secret *secret;
>> +
>> +   if (arch_get_random_int(_result))
>> +   return arch_result;
>> +
>> +   secret = get_random_int_secret();
>> +   result = siphash_3u64(secret->chaining, jiffies,
>> + (u64)random_get_entropy() + current->pid,
>> + secret->secret);
>> +   secret->chaining += result;
>> +   put_cpu_var(secret);
>> +   return result;
>>  }
>>  EXPORT_SYMBOL(get_random_int);
> 
> Hmm.  I haven't tried to prove anything for real.  But here goes (in
> the random oracle model):
> 
> Suppose I'm an attacker and I don't know the secret or the chaining
> value.  Then, regardless of what the entropy is, I can't predict the
> numbers.
> 
> Now suppose I do know the secret and the chaining value due to some
> leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
> to find a value "result" such that prev_chaining + result = chaining
> and result = H(prev_chaining, ..., secret);.  I don't think this can
> be done efficiently in the random oracle model regardless of what the
> "..." is.
> 
> But, if I know the secret and chaining value, I can predict the next
> output assuming I can guess the entropy.  What's worse is that, even
> if I can't guess the entropy, if I *observe* the next output then I
> can calculate the next chaining value.
> 
> So this is probably good enough, and making it better is hard.  Changing it 
> to:
> 
> u64 entropy = (u64)random_get_entropy() + current->pid;
> result = siphash(..., entropy, ...);
> secret->chaining += result + entropy;
> 
> would reduce this problem by forcing an attacker to brute-force the
> entropy on each iteration, which is probably an improvement.
> 
> To fully fix it, something like "catastrophic reseeding" would be
> needed, but that's hard to get right.

I wonder if Ted's proposal was analyzed further in terms of performance
if get_random_int should provide cprng alike properties?

For reference: https://lkml.org/lkml/2016/12/14/351

The proposal made sense to me and would completely solve the above
mentioned problem on the cost of repeatedly reseeding from the crng.

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvalds
 wrote:
> On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin
>  wrote:
>>
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.
>
> And I warn you already: it will _benchmark_ a hell of a lot better
> than it will work in reality. In benchmarks, you'll hit all the
> optimizations ("oh, I've already saved away all the FP registers, no
> need to do it again").
>
> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Hah, you're thinking that the x86 code works the way that Rik and I
want it to work, and you just made my day. :)  What actually happens
is that the state is saved in kernel_fpu_begin() and restored in
kernel_fpu_end(), and it'll take a few hundred cycles best case.  If
you do it a bunch of times in a loop, you *might* trigger a CPU
optimization that notices that the state being saved is the same state
that was just restored, but you're still going to pay the full restore
code each round trip no matter what.

The code is much clearer in 4.10 kernels now that I deleted the unused
"lazy" branches.

>
> Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX
> unit and keep it powered up for a while afterwards", while in real
> life you would quite easily hit the "oh, AVX is powered down because
> we were idle, now it powers up at half speed which is another latency
> hit _and_ the AVX unit won't run full out anyway".

I *think* that was mostly fixed in Broadwell or thereabouts (in terms
of latency -- throughput and power consumption still suffers).
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/6] siphash: add cryptographically secure PRF

2016-12-21 Thread Jason A. Donenfeld
On Thu, Dec 22, 2016 at 2:40 AM, Stephen Hemminger
 wrote:
> The networking tree (net-next) which is where you are submitting to is 
> technically
> closed right now.

That's okay. At some point in the future it will be open. By then v83
of this patch set will be shiny and done, just waiting for the merge
window to open. There's a lot to discuss with this, so getting the
feedback early is beneficial.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/6] siphash: add cryptographically secure PRF

2016-12-21 Thread Stephen Hemminger
On Thu, 22 Dec 2016 00:02:11 +0100
"Jason A. Donenfeld"  wrote:

> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function, or as a
> general PRF for short input use cases, such as sequence numbers or RNG
> chaining.
> 
> For the first usage:
> 
> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector. Currently
> hashtables use jhash, which is fast but not secure, and some kind of
> rotating key scheme (or none at all, which isn't good). SipHash is meant
> as a replacement for jhash in these cases.
> 
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.
> 
> While SipHash is extremely fast for a cryptographically secure function,
> it is likely a bit slower than the insecure jhash, and so replacements
> will be evaluated on a case-by-case basis based on whether or not the
> difference in speed is negligible and whether or not the current jhash usage
> poses a real security risk.
> 
> For the second usage:
> 
> A few places in the kernel are using MD5 or SHA1 for creating secure
> sequence numbers, syn cookies, port numbers, or fast random numbers.
> SipHash is a faster and more fitting, and more secure replacement for MD5
> in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
> obvious and straight-forward, and so is submitted along with this patch
> series. There shouldn't be much of a debate over its efficacy.
> 
> Dozens of languages are already using this internally for their hash
> tables and PRFs. Some of the BSDs already use this in their kernels.
> SipHash is a widely known high-speed solution to a widely known set of
> problems, and it's time we catch-up.
> 
> Signed-off-by: Jason A. Donenfeld 
> Cc: Jean-Philippe Aumasson 
> Cc: Linus Torvalds 
> Cc: Eric Biggers 
> Cc: David Laight 
> Cc: Eric Dumazet 

The networking tree (net-next) which is where you are submitting to is 
technically
closed right now.

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
As a separate message, to disentangle the threads, I'd like to
talk about get_random_long().

After some thinking, I still like the "state-preserving" construct
that's equivalent to the current MD5 code.  Yes, we could just do
siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
preserve a bit more.

It requires library support from the SipHash code to return the full
SipHash state, but I hope that's a fair thing to ask for.

Here's my current straw man design for comment.  It's very similar to
the current MD5-based design, but feeds all the seed material in the
"correct" way, as opposed to Xring directly into the MD5 state.

* Each CPU has a (Half)SipHash state vector,
  "unsigned long get_random_int_hash[4]".  Unlike the current
  MD5 code, we take care to initialize it to an asymmetric state.

* There's a global 256-bit random_int_secret (which we could
  reseed periodically).

To generate a random number:
* If get_random_int_hash is all-zero, seed it with fresh a half-sized
  SipHash key and the appropriate XOR constants.
* Generate three words of random_get_entropy(), jiffies, and current->pid.
  (This is arbitary seed material, copied from the current code.)
* Crank through that with (Half)SipHash-1-0.
* Crank through the random_int_secret with (Half)SipHash-1-0.
* Return v1 ^ v3.

Here are the reasons:
* The first step is just paranoia, but SipHash's security promise depends
  on starting with an asymmetric state, we want unique per-CPU states,
  and it's a one-time cost.
* When the input words are themselves secret, there's no security
  advantage, and almost no speed advantage, to doing two rounds for one
  input word versus two words with one round each.  Thus, SipHash-1.
* The above is not exactly true, due to the before+after XOR pattern
  that SipHash uses, but I think it's true anyway.
* Likewise, there's no benefit to unkeyed finalization rounds over keyed
  ones.  That's why I just enlarged the global secret.
* The per-call seed material is hashed first on general principles,
  because that's the novel part that might have fresh entropy.
* To the extent the initial state is secret, the rounds processing the
  global secret are 4 finalization rounds for the initial state and
  the per-call entropy.
* The final word(s) of the global secret might be vulnerable to analysis,
  due to incomplete mixing, but since the global secret is always hashed
  in the same order, and larger that the desired security level, the
  initial words should be secure.
* By carrying forward the full internal state, we ensure that repeated
  calls return different results, and to the extent that the per-call
  seed material has entropy, it's preserved.
* The final return is all that's needed, since the last steps in the 
  SipRound are "v1 ^= v2" and "v3 ^= v0".  It's no security loss,
  and a very minor speedup.
* Also, this avoids directly "exposing" the final XOR with the last
  word of the global secret (which is made to v0).

If I'm allowed to use full SipHash, some shortcuts can be taken,
but I believe the above would be secure with HalfSipHash.

If additional performance is required, I'd consider shrinking the
global secret to 192 bits on 32-bit machines but I want more than
128 bits of ey material, and enough rounds to be equivalent to 4
finalization rounds.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables

2016-12-21 Thread Andi Kleen
> 64-bit x86_64:
> [0.509409] test_siphash: SipHash2-4 cycles: 4049181
> [0.510650] test_siphash: SipHash1-3 cycles: 2512884
> [0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920
> [0.512904] test_siphash:JenkinsHash cycles:  978267

I'm not sure what these numbers mean. Surely a single siphash2-4
does not take 4+ million cycles? 

If you run them in a loop please divide by the iterations.

But generally running small code in a loop is often an unrealistic
benchmark strategy because it hides cache misses, primes
predictors, changes frequencies and changes memory costs,
but also can overload pipelines and oversubscribe
resources.

[see also page 46+ in http://halobates.de/applicative-mental-models.pdf]

So the numbers you get there are at least somewhat
dubious. It would be good to have at least some test which
is not just a tiny micro benchmark to compare before making
conclusions.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Theodore Ts'o wrote:
> On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
>> SipHash annihilates the competition on 64-bit superscalar hardware.
>> SipHash dominates the field on 64-bit in-order hardware.
>> SipHash wins easily on 32-bit hardware *with enough registers*.
>> On register-starved 32-bit machines, it really struggles.

> And "with enough registers" includes ARM and MIPS, right?

Yes.  As a matter of fact, 32-bit ARM does particularly well
on 64-bit SipHash due to its shift+op instructions.

There is a noticeable performance drop, but nothing catastrophic.

The main thing I've been worried about is all the flow tracking
and NAT done by small home routers, and that's addressed by using
HalfSipHash for the hash tables.  They don't *initiate* a lot of
TCP sessions.

> So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections.  :-)

The only requirement on performance is "don't make DaveM angry." :-)

I was just trying to answer the question of why we *worried* about the
performance, not specifically argue that we *should* use HalfSipHash.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld  wrote:
>  unsigned int get_random_int(void)
>  {
> -   __u32 *hash;
> -   unsigned int ret;
> -
> -   if (arch_get_random_int())
> -   return ret;
> -
> -   hash = get_cpu_var(get_random_int_hash);
> -
> -   hash[0] += current->pid + jiffies + random_get_entropy();
> -   md5_transform(hash, random_int_secret);
> -   ret = hash[0];
> -   put_cpu_var(get_random_int_hash);
> -
> -   return ret;
> +   unsigned int arch_result;
> +   u64 result;
> +   struct random_int_secret *secret;
> +
> +   if (arch_get_random_int(_result))
> +   return arch_result;
> +
> +   secret = get_random_int_secret();
> +   result = siphash_3u64(secret->chaining, jiffies,
> + (u64)random_get_entropy() + current->pid,
> + secret->secret);
> +   secret->chaining += result;
> +   put_cpu_var(secret);
> +   return result;
>  }
>  EXPORT_SYMBOL(get_random_int);

Hmm.  I haven't tried to prove anything for real.  But here goes (in
the random oracle model):

Suppose I'm an attacker and I don't know the secret or the chaining
value.  Then, regardless of what the entropy is, I can't predict the
numbers.

Now suppose I do know the secret and the chaining value due to some
leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
to find a value "result" such that prev_chaining + result = chaining
and result = H(prev_chaining, ..., secret);.  I don't think this can
be done efficiently in the random oracle model regardless of what the
"..." is.

But, if I know the secret and chaining value, I can predict the next
output assuming I can guess the entropy.  What's worse is that, even
if I can't guess the entropy, if I *observe* the next output then I
can calculate the next chaining value.

So this is probably good enough, and making it better is hard.  Changing it to:

u64 entropy = (u64)random_get_entropy() + current->pid;
result = siphash(..., entropy, ...);
secret->chaining += result + entropy;

would reduce this problem by forcing an attacker to brute-force the
entropy on each iteration, which is probably an improvement.

To fully fix it, something like "catastrophic reseeding" would be
needed, but that's hard to get right.

(An aside: on x86 at least, using two percpu variables is faster
because directly percpu access is essentially free, whereas getting
the address of a percpu variable is not free.)
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Jason A. Donenfeld
Hi Ted,

On Thu, Dec 22, 2016 at 12:02 AM, Jason A. Donenfeld  wrote:
> This duplicates the current algorithm for get_random_int/long

I should have mentioned this directly in the commit message, which I
forgot to update: this v7 adds the time-based key rotation, which,
while not strictly necessary for ensuring the security of the RNG,
might help alleviate some concerns, as we talked about. Performance is
quite good on both 32-bit and 64-bit -- better than MD5 in both cases.

If you like this, terrific. If not, I'm happy to take this in whatever
direction you prefer, and implement whatever construction you think
best. There's been a lot of noise on this list about it; we can
continue to discuss more, or you can just tell me whatever you want to
do, and I'll implement it and that'll be the end of it. As you said,
we can always get something decent now and improve it later.

Alternatively, if you've decided in the end you prefer your batched
entropy approach using chacha, I'm happy to implement a polished
version of that here in this patch series (so that we can keep the `rm
lib/md5.c` commit.)

Just let me know how you'd like to proceed.

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables

2016-12-21 Thread Jason A. Donenfeld
HalfSipHash, or hsiphash, is a shortened version of SipHash, which
generates 32-bit outputs using a weaker 64-bit key. It has *much* lower
security margins, and shouldn't be used for anything too sensitive, but
it could be used as a hashtable key function replacement, if the output
is never exposed, and if the security requirement is not too high.

The goal is to make this something that performance-critical jhash users
would be willing to use.

On 64-bit machines, HalfSipHash1-3 is slower than SipHash1-3, so we alias
SipHash1-3 to HalfSipHash1-3 on those systems.

64-bit x86_64:
[0.509409] test_siphash: SipHash2-4 cycles: 4049181
[0.510650] test_siphash: SipHash1-3 cycles: 2512884
[0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920
[0.512904] test_siphash:JenkinsHash cycles:  978267
So, we map hsiphash() -> SipHash1-3

32-bit x86:
[0.509868] test_siphash: SipHash2-4 cycles: 14812892
[0.513601] test_siphash: SipHash1-3 cycles:  9510710
[0.515263] test_siphash: HalfSipHash1-3 cycles:  3856157
[0.515952] test_siphash:JenkinsHash cycles:  1148567
So, we map hsiphash() -> HalfSipHash1-3

hsiphash() is roughly 3 times slower than jhash(), but comes with a
considerable security improvement.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
---
 Documentation/siphash.txt |  75 +++
 include/linux/siphash.h   |  56 +++-
 lib/siphash.c | 318 +-
 lib/test_siphash.c| 139 
 4 files changed, 561 insertions(+), 27 deletions(-)

diff --git a/Documentation/siphash.txt b/Documentation/siphash.txt
index 39ff7f0438e7..f93c1d7104c4 100644
--- a/Documentation/siphash.txt
+++ b/Documentation/siphash.txt
@@ -77,3 +77,78 @@ Linux implements the "2-4" variant of SipHash.
 
 Read the SipHash paper if you're interested in learning more:
 https://131002.net/siphash/siphash.pdf
+
+
+~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~
+
+HalfSipHash - SipHash's insecure younger cousin
+---
+Written by Jason A. Donenfeld 
+
+On the off-chance that SipHash is not fast enough for your needs, you might be
+able to justify using HalfSipHash, a terrifying but potentially useful
+possibility. HalfSipHash cuts SipHash's rounds down from "2-4" to "1-3" and,
+even scarier, uses an easily brute-forcable 64-bit key (with a 32-bit output)
+instead of SipHash's 128-bit key. However, this may appeal to some
+high-performance `jhash` users.
+
+Danger!
+
+Do not ever use HalfSipHash except for as a hashtable key function, and only
+then when you can be absolutely certain that the outputs will never be
+transmitted out of the kernel. This is only remotely useful over `jhash` as a
+means of mitigating hashtable flooding denial of service attacks.
+
+1. Generating a key
+
+Keys should always be generated from a cryptographically secure source of
+random numbers, either using get_random_bytes or get_random_once:
+
+hsiphash_key_t key;
+get_random_bytes(key, sizeof(key));
+
+If you're not deriving your key from here, you're doing it wrong.
+
+2. Using the functions
+
+There are two variants of the function, one that takes a list of integers, and
+one that takes a buffer:
+
+u32 hsiphash(const void *data, size_t len, siphash_key_t key);
+
+And:
+
+u32 hsiphash_1u32(u32, hsiphash_key_t key);
+u32 hsiphash_2u32(u32, u32, hsiphash_key_t key);
+u32 hsiphash_3u32(u32, u32, u32, hsiphash_key_t key);
+u32 hsiphash_4u32(u32, u32, u32, u32, hsiphash_key_t key);
+
+If you pass the generic hsiphash function something of a constant length, it
+will constant fold at compile-time and automatically choose one of the
+optimized functions.
+
+3. Hashtable key function usage:
+
+struct some_hashtable {
+   DECLARE_HASHTABLE(hashtable, 8);
+   hsiphash_key_t key;
+};
+
+void init_hashtable(struct some_hashtable *table)
+{
+   get_random_bytes(table->key, sizeof(table->key));
+}
+
+static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, 
struct interesting_input *input)
+{
+   return >hashtable[hsiphash(input, sizeof(*input), table->key) & 
(HASH_SIZE(table->hashtable) - 1)];
+}
+
+You may then iterate like usual over the returned hash bucket.
+
+4. Performance
+
+HalfSipHash is roughly 3 times slower than JenkinsHash. For many replacements,
+this will not be a problem, as the hashtable lookup isn't the bottleneck. And
+in general, this is probably a good sacrifice to make for the security and DoS
+resistance of HalfSipHash.
diff --git a/include/linux/siphash.h b/include/linux/siphash.h
index 7aa666eb00d9..efab44c654f3 100644
--- a/include/linux/siphash.h
+++ b/include/linux/siphash.h
@@ -5,7 +5,9 @@
  * SipHash: a fast short-input PRF
  * https://131002.net/siphash/
  *
- * This implementation is specifically for SipHash2-4.
+ * This 

[PATCH v7 5/6] syncookies: use SipHash in place of SHA1

2016-12-21 Thread Jason A. Donenfeld
SHA1 is slower and less secure than SipHash, and so replacing syncookie
generation with SipHash makes natural sense. Some BSDs have been doing
this for several years in fact.

The speedup should be similar -- and even more impressive -- to the
speedup from the sequence number fix in this series.

Signed-off-by: Jason A. Donenfeld 
Cc: Eric Dumazet 
Cc: David Miller 
---
 net/ipv4/syncookies.c | 20 
 net/ipv6/syncookies.c | 37 -
 2 files changed, 20 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 3e88467d70ee..03bb068f 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -13,13 +13,13 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
-static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
+static siphash_key_t syncookie_secret[2] __read_mostly;
 
 #define COOKIEBITS 24  /* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
@@ -48,24 +48,12 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] 
__read_mostly;
 #define TSBITS 6
 #define TSMASK (((__u32)1 << TSBITS) - 1)
 
-static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS], 
ipv4_cookie_scratch);
-
 static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
   u32 count, int c)
 {
-   __u32 *tmp;
-
net_get_random_once(syncookie_secret, sizeof(syncookie_secret));
-
-   tmp  = this_cpu_ptr(ipv4_cookie_scratch);
-   memcpy(tmp + 4, syncookie_secret[c], sizeof(syncookie_secret[c]));
-   tmp[0] = (__force u32)saddr;
-   tmp[1] = (__force u32)daddr;
-   tmp[2] = ((__force u32)sport << 16) + (__force u32)dport;
-   tmp[3] = count;
-   sha_transform(tmp + 16, (__u8 *)tmp, tmp + 16 + 5);
-
-   return tmp[17];
+   return siphash_4u32(saddr, daddr, (u32)sport << 16 | dport, count,
+   syncookie_secret[c]);
 }
 
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index a4d49760bf43..be51fc0d99ad 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -16,7 +16,7 @@
 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -24,7 +24,7 @@
 #define COOKIEBITS 24  /* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
 
-static u32 syncookie6_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
+static siphash_key_t syncookie6_secret[2] __read_mostly;
 
 /* RFC 2460, Section 8.3:
  * [ipv6 tcp] MSS must be computed as the maximum packet size minus 60 [..]
@@ -41,30 +41,25 @@ static __u16 const msstab[] = {
9000 - 60,
 };
 
-static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS], 
ipv6_cookie_scratch);
-
 static u32 cookie_hash(const struct in6_addr *saddr, const struct in6_addr 
*daddr,
   __be16 sport, __be16 dport, u32 count, int c)
 {
-   __u32 *tmp;
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   u32 count;
+   u16 sport;
+   u16 dport;
+   } __aligned(SIPHASH_ALIGNMENT) combined = {
+   .saddr = *saddr,
+   .daddr = *daddr,
+   .count = count,
+   .sport = sport,
+   .dport = dport
+   };
 
net_get_random_once(syncookie6_secret, sizeof(syncookie6_secret));
-
-   tmp  = this_cpu_ptr(ipv6_cookie_scratch);
-
-   /*
-* we have 320 bits of information to hash, copy in the remaining
-* 192 bits required for sha_transform, from the syncookie6_secret
-* and overwrite the digest with the secret
-*/
-   memcpy(tmp + 10, syncookie6_secret[c], 44);
-   memcpy(tmp, saddr, 16);
-   memcpy(tmp + 4, daddr, 16);
-   tmp[8] = ((__force u32)sport << 16) + (__force u32)dport;
-   tmp[9] = count;
-   sha_transform(tmp + 16, (__u8 *)tmp, tmp + 16 + 5);
-
-   return tmp[17];
+   return siphash(, offsetofend(typeof(combined), dport), 
syncookie6_secret[c]);
 }
 
 static __u32 secure_tcp_syn_cookie(const struct in6_addr *saddr,
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 2/6] secure_seq: use SipHash in place of MD5

2016-12-21 Thread Jason A. Donenfeld
This gives a clear speed and security improvement. Siphash is both
faster and is more solid crypto than the aging MD5.

Rather than manually filling MD5 buffers, for IPv6, we simply create
a layout by a simple anonymous struct, for which gcc generates
rather efficient code. For IPv4, we pass the values directly to the
short input convenience functions.

64-bit x86_64:
[1.683628] secure_tcpv6_sequence_number_md5# cycles: 99563527
[1.717350] secure_tcp_sequence_number_md5# cycles: 92890502
[1.741968] secure_tcpv6_sequence_number_siphash# cycles: 67825362
[1.762048] secure_tcp_sequence_number_siphash# cycles: 67485526

32-bit x86:
[1.600012] secure_tcpv6_sequence_number_md5# cycles: 103227892
[1.634219] secure_tcp_sequence_number_md5# cycles: 94732544
[1.669102] secure_tcpv6_sequence_number_siphash# cycles: 96299384
[1.700165] secure_tcp_sequence_number_siphash# cycles: 86015473

Signed-off-by: Jason A. Donenfeld 
Cc: Andi Kleen 
Cc: David Miller 
Cc: David Laight 
Cc: Tom Herbert 
Cc: Hannes Frederic Sowa 
Cc: Eric Dumazet 
---
 net/core/secure_seq.c | 135 --
 1 file changed, 54 insertions(+), 81 deletions(-)

diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 88a8e429fc3e..3dc2689bcc64 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -1,3 +1,5 @@
+/* Copyright (C) 2016 Jason A. Donenfeld . All Rights 
Reserved. */
+
 #include 
 #include 
 #include 
@@ -8,14 +10,14 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET)
+#include 
 #include 
-#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4)
 
-static u32 net_secret[NET_SECRET_SIZE] cacheline_aligned;
+static siphash_key_t net_secret;
 
 static __always_inline void net_secret_init(void)
 {
@@ -44,80 +46,65 @@ static u32 seq_scale(u32 seq)
 u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
 __be16 sport, __be16 dport, u32 *tsoff)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 sport;
+   __be16 dport;
+   } __aligned(SIPHASH_ALIGNMENT) combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .sport = sport,
+   .dport = dport
+   };
+   u64 hash;
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32)daddr[i];
-   secret[4] = net_secret[4] +
-   (((__force u16)sport << 16) + (__force u16)dport);
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0;
-   return seq_scale(hash[0]);
+   hash = siphash(, offsetofend(typeof(combined), dport), 
net_secret);
+   *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0;
+   return seq_scale(hash);
 }
 EXPORT_SYMBOL(secure_tcpv6_sequence_number);
 
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 dport;
+   } __aligned(SIPHASH_ALIGNMENT) combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .dport = dport
+   };
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32) daddr[i];
-   secret[4] = net_secret[4] + (__force u32)dport;
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   return hash[0];
+   return siphash(, offsetofend(typeof(combined), dport), 
net_secret);
 }
 EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 #endif
 
 #ifdef CONFIG_INET
 
+/* secure_tcp_sequence_number(a, b, 0, d) == secure_ipv4_port_ephemeral(a, b, 
d),
+ * but fortunately, `sport' cannot be 0 in any circumstances. If this changes,
+ * it would be easy enough to have the former function use siphash_4u32, 
passing
+ * the arguments as separate u32.
+ */
+
 u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
   __be16 sport, __be16 dport, u32 *tsoff)
 {
-   u32 hash[MD5_DIGEST_WORDS];
-
+   u64 hash;
 

[PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Jason A. Donenfeld
This duplicates the current algorithm for get_random_int/long, but uses
siphash instead. This comes with several benefits. It's certainly
faster and more cryptographically secure than MD5. This patch also
separates hashed fields into three values instead of one, in order to
increase diffusion.

The previous MD5 algorithm used a per-cpu MD5 state, which caused
successive calls to the function to chain upon each other. While it's
not entirely clear that this kind of chaining is absolutely necessary
when using a secure PRF like siphash, it can't hurt, and the timing of
the call chain does add a degree of natural entropy. So, in keeping with
this design, instead of the massive per-cpu 64-byte MD5 state, there is
instead a per-cpu previously returned value for chaining.

The speed benefits are substantial:

| siphash | md5| speedup |
--
get_random_long | 137130  | 415983 | 3.03x   |
get_random_int  | 86384   | 343323 | 3.97x   |

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
Cc: Ted Tso 
---
 drivers/char/random.c  | 84 +-
 include/linux/random.h |  1 -
 init/main.c|  1 -
 3 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..ea9858d9d8b4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -262,6 +262,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -2042,17 +2043,31 @@ struct ctl_table random_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
-static u32 random_int_secret[MD5_MESSAGE_BYTES / 4] cacheline_aligned;
 
-int random_int_secret_init(void)
+struct random_int_secret {
+   siphash_key_t secret;
+   u64 chaining;
+   unsigned long birthdate;
+   bool initialized;
+};
+static DEFINE_PER_CPU(struct random_int_secret, random_int_secret);
+
+enum {
+   SECRET_ROTATION_TIME = HZ * 60 * 5
+};
+
+static struct random_int_secret *get_random_int_secret(void)
 {
-   get_random_bytes(random_int_secret, sizeof(random_int_secret));
-   return 0;
+   struct random_int_secret *secret = _cpu_var(random_int_secret);
+   if (unlikely(!secret->initialized ||
+!time_is_after_jiffies(secret->birthdate + 
SECRET_ROTATION_TIME))) {
+   secret->initialized = true;
+   secret->birthdate = jiffies;
+   get_random_bytes(secret->secret, sizeof(secret->secret));
+   }
+   return secret;
 }
 
-static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash)
-   __aligned(sizeof(unsigned long));
-
 /*
  * Get a random word for internal kernel use only. Similar to urandom but
  * with the goal of minimal entropy pool depletion. As a result, the random
@@ -2061,20 +2076,20 @@ static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], 
get_random_int_hash)
  */
 unsigned int get_random_int(void)
 {
-   __u32 *hash;
-   unsigned int ret;
-
-   if (arch_get_random_int())
-   return ret;
-
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = hash[0];
-   put_cpu_var(get_random_int_hash);
-
-   return ret;
+   unsigned int arch_result;
+   u64 result;
+   struct random_int_secret *secret;
+
+   if (arch_get_random_int(_result))
+   return arch_result;
+
+   secret = get_random_int_secret();
+   result = siphash_3u64(secret->chaining, jiffies,
+ (u64)random_get_entropy() + current->pid,
+ secret->secret);
+   secret->chaining += result;
+   put_cpu_var(secret);
+   return result;
 }
 EXPORT_SYMBOL(get_random_int);
 
@@ -2083,20 +2098,19 @@ EXPORT_SYMBOL(get_random_int);
  */
 unsigned long get_random_long(void)
 {
-   __u32 *hash;
-   unsigned long ret;
-
-   if (arch_get_random_long())
-   return ret;
-
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = *(unsigned long *)hash;
-   put_cpu_var(get_random_int_hash);
-
-   return ret;
+   unsigned long arch_result;
+   u64 result;
+   struct random_int_secret *secret;
+
+   if (arch_get_random_long(_result))
+   return arch_result;
+
+   secret = get_random_int_secret();
+   result = siphash_3u64(secret->chaining, jiffies, random_get_entropy() +
+ current->pid, secret->secret);
+   secret->chaining += result;
+   put_cpu_var(secret);
+   return result;
 }
 EXPORT_SYMBOL(get_random_long);
 
diff --git a/include/linux/random.h b/include/linux/random.h
index 

[PATCH v7 4/6] md5: remove from lib and only live in crypto

2016-12-21 Thread Jason A. Donenfeld
The md5_transform function is no longer used any where in the tree,
except for the crypto api's actual implementation of md5, so we can drop
the function from lib and put it as a static function of the crypto
file, where it belongs. There should be no new users of md5_transform,
anyway, since there are more modern ways of doing what it once achieved.

Signed-off-by: Jason A. Donenfeld 
---
 crypto/md5.c | 95 +++-
 lib/Makefile |  2 +-
 lib/md5.c| 95 
 3 files changed, 95 insertions(+), 97 deletions(-)
 delete mode 100644 lib/md5.c

diff --git a/crypto/md5.c b/crypto/md5.c
index 2355a7c25c45..f7ae1a48225b 100644
--- a/crypto/md5.c
+++ b/crypto/md5.c
@@ -21,9 +21,11 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
+#define MD5_DIGEST_WORDS 4
+#define MD5_MESSAGE_BYTES 64
+
 const u8 md5_zero_message_hash[MD5_DIGEST_SIZE] = {
0xd4, 0x1d, 0x8c, 0xd9, 0x8f, 0x00, 0xb2, 0x04,
0xe9, 0x80, 0x09, 0x98, 0xec, 0xf8, 0x42, 0x7e,
@@ -47,6 +49,97 @@ static inline void cpu_to_le32_array(u32 *buf, unsigned int 
words)
}
 }
 
+#define F1(x, y, z)(z ^ (x & (y ^ z)))
+#define F2(x, y, z)F1(z, x, y)
+#define F3(x, y, z)(x ^ y ^ z)
+#define F4(x, y, z)(y ^ (x | ~z))
+
+#define MD5STEP(f, w, x, y, z, in, s) \
+   (w += f(x, y, z) + in, w = (w<>(32-s)) + x)
+
+static void md5_transform(__u32 *hash, __u32 const *in)
+{
+   u32 a, b, c, d;
+
+   a = hash[0];
+   b = hash[1];
+   c = hash[2];
+   d = hash[3];
+
+   MD5STEP(F1, a, b, c, d, in[0] + 0xd76aa478, 7);
+   MD5STEP(F1, d, a, b, c, in[1] + 0xe8c7b756, 12);
+   MD5STEP(F1, c, d, a, b, in[2] + 0x242070db, 17);
+   MD5STEP(F1, b, c, d, a, in[3] + 0xc1bdceee, 22);
+   MD5STEP(F1, a, b, c, d, in[4] + 0xf57c0faf, 7);
+   MD5STEP(F1, d, a, b, c, in[5] + 0x4787c62a, 12);
+   MD5STEP(F1, c, d, a, b, in[6] + 0xa8304613, 17);
+   MD5STEP(F1, b, c, d, a, in[7] + 0xfd469501, 22);
+   MD5STEP(F1, a, b, c, d, in[8] + 0x698098d8, 7);
+   MD5STEP(F1, d, a, b, c, in[9] + 0x8b44f7af, 12);
+   MD5STEP(F1, c, d, a, b, in[10] + 0x5bb1, 17);
+   MD5STEP(F1, b, c, d, a, in[11] + 0x895cd7be, 22);
+   MD5STEP(F1, a, b, c, d, in[12] + 0x6b901122, 7);
+   MD5STEP(F1, d, a, b, c, in[13] + 0xfd987193, 12);
+   MD5STEP(F1, c, d, a, b, in[14] + 0xa679438e, 17);
+   MD5STEP(F1, b, c, d, a, in[15] + 0x49b40821, 22);
+
+   MD5STEP(F2, a, b, c, d, in[1] + 0xf61e2562, 5);
+   MD5STEP(F2, d, a, b, c, in[6] + 0xc040b340, 9);
+   MD5STEP(F2, c, d, a, b, in[11] + 0x265e5a51, 14);
+   MD5STEP(F2, b, c, d, a, in[0] + 0xe9b6c7aa, 20);
+   MD5STEP(F2, a, b, c, d, in[5] + 0xd62f105d, 5);
+   MD5STEP(F2, d, a, b, c, in[10] + 0x02441453, 9);
+   MD5STEP(F2, c, d, a, b, in[15] + 0xd8a1e681, 14);
+   MD5STEP(F2, b, c, d, a, in[4] + 0xe7d3fbc8, 20);
+   MD5STEP(F2, a, b, c, d, in[9] + 0x21e1cde6, 5);
+   MD5STEP(F2, d, a, b, c, in[14] + 0xc33707d6, 9);
+   MD5STEP(F2, c, d, a, b, in[3] + 0xf4d50d87, 14);
+   MD5STEP(F2, b, c, d, a, in[8] + 0x455a14ed, 20);
+   MD5STEP(F2, a, b, c, d, in[13] + 0xa9e3e905, 5);
+   MD5STEP(F2, d, a, b, c, in[2] + 0xfcefa3f8, 9);
+   MD5STEP(F2, c, d, a, b, in[7] + 0x676f02d9, 14);
+   MD5STEP(F2, b, c, d, a, in[12] + 0x8d2a4c8a, 20);
+
+   MD5STEP(F3, a, b, c, d, in[5] + 0xfffa3942, 4);
+   MD5STEP(F3, d, a, b, c, in[8] + 0x8771f681, 11);
+   MD5STEP(F3, c, d, a, b, in[11] + 0x6d9d6122, 16);
+   MD5STEP(F3, b, c, d, a, in[14] + 0xfde5380c, 23);
+   MD5STEP(F3, a, b, c, d, in[1] + 0xa4beea44, 4);
+   MD5STEP(F3, d, a, b, c, in[4] + 0x4bdecfa9, 11);
+   MD5STEP(F3, c, d, a, b, in[7] + 0xf6bb4b60, 16);
+   MD5STEP(F3, b, c, d, a, in[10] + 0xbebfbc70, 23);
+   MD5STEP(F3, a, b, c, d, in[13] + 0x289b7ec6, 4);
+   MD5STEP(F3, d, a, b, c, in[0] + 0xeaa127fa, 11);
+   MD5STEP(F3, c, d, a, b, in[3] + 0xd4ef3085, 16);
+   MD5STEP(F3, b, c, d, a, in[6] + 0x04881d05, 23);
+   MD5STEP(F3, a, b, c, d, in[9] + 0xd9d4d039, 4);
+   MD5STEP(F3, d, a, b, c, in[12] + 0xe6db99e5, 11);
+   MD5STEP(F3, c, d, a, b, in[15] + 0x1fa27cf8, 16);
+   MD5STEP(F3, b, c, d, a, in[2] + 0xc4ac5665, 23);
+
+   MD5STEP(F4, a, b, c, d, in[0] + 0xf4292244, 6);
+   MD5STEP(F4, d, a, b, c, in[7] + 0x432aff97, 10);
+   MD5STEP(F4, c, d, a, b, in[14] + 0xab9423a7, 15);
+   MD5STEP(F4, b, c, d, a, in[5] + 0xfc93a039, 21);
+   MD5STEP(F4, a, b, c, d, in[12] + 0x655b59c3, 6);
+   MD5STEP(F4, d, a, b, c, in[3] + 0x8f0ccc92, 10);
+   MD5STEP(F4, c, d, a, b, in[10] + 0xffeff47d, 15);
+   MD5STEP(F4, b, c, d, a, in[1] + 0x85845dd1, 21);
+   MD5STEP(F4, a, b, c, d, in[8] + 0x6fa87e4f, 6);
+   MD5STEP(F4, d, a, b, c, in[15] + 0xfe2ce6e0, 10);
+   MD5STEP(F4, c, d, a, b, in[6] + 0xa3014314, 15);
+ 

[PATCH v7 1/6] siphash: add cryptographically secure PRF

2016-12-21 Thread Jason A. Donenfeld
SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function, or as a
general PRF for short input use cases, such as sequence numbers or RNG
chaining.

For the first usage:

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector. Currently
hashtables use jhash, which is fast but not secure, and some kind of
rotating key scheme (or none at all, which isn't good). SipHash is meant
as a replacement for jhash in these cases.

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

While SipHash is extremely fast for a cryptographically secure function,
it is likely a bit slower than the insecure jhash, and so replacements
will be evaluated on a case-by-case basis based on whether or not the
difference in speed is negligible and whether or not the current jhash usage
poses a real security risk.

For the second usage:

A few places in the kernel are using MD5 or SHA1 for creating secure
sequence numbers, syn cookies, port numbers, or fast random numbers.
SipHash is a faster and more fitting, and more secure replacement for MD5
in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
obvious and straight-forward, and so is submitted along with this patch
series. There shouldn't be much of a debate over its efficacy.

Dozens of languages are already using this internally for their hash
tables and PRFs. Some of the BSDs already use this in their kernels.
SipHash is a widely known high-speed solution to a widely known set of
problems, and it's time we catch-up.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
Cc: Linus Torvalds 
Cc: Eric Biggers 
Cc: David Laight 
Cc: Eric Dumazet 
---
 Documentation/siphash.txt |  79 
 MAINTAINERS   |   7 ++
 include/linux/siphash.h   |  79 
 lib/Kconfig.debug |   6 +-
 lib/Makefile  |   5 +-
 lib/siphash.c | 232 ++
 lib/test_siphash.c| 119 
 7 files changed, 522 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/siphash.txt
 create mode 100644 include/linux/siphash.h
 create mode 100644 lib/siphash.c
 create mode 100644 lib/test_siphash.c

diff --git a/Documentation/siphash.txt b/Documentation/siphash.txt
new file mode 100644
index ..39ff7f0438e7
--- /dev/null
+++ b/Documentation/siphash.txt
@@ -0,0 +1,79 @@
+ SipHash - a short input PRF
+---
+Written by Jason A. Donenfeld 
+
+SipHash is a cryptographically secure PRF -- a keyed hash function -- that
+performs very well for short inputs, hence the name. It was designed by
+cryptographers Daniel J. Bernstein and Jean-Philippe Aumasson. It is intended
+as a replacement for some uses of: `jhash`, `md5_transform`, `sha_transform`,
+and so forth.
+
+SipHash takes a secret key filled with randomly generated numbers and either
+an input buffer or several input integers. It spits out an integer that is
+indistinguishable from random. You may then use that integer as part of secure
+sequence numbers, secure cookies, or mask it off for use in a hash table.
+
+1. Generating a key
+
+Keys should always be generated from a cryptographically secure source of
+random numbers, either using get_random_bytes or get_random_once:
+
+siphash_key_t key;
+get_random_bytes(key, sizeof(key));
+
+If you're not deriving your key from here, you're doing it wrong.
+
+2. Using the functions
+
+There are two variants of the function, one that takes a list of integers, and
+one that takes a buffer:
+
+u64 siphash(const void *data, size_t len, siphash_key_t key);
+
+And:
+
+u64 siphash_1u64(u64, siphash_key_t key);
+u64 siphash_2u64(u64, u64, siphash_key_t key);
+u64 siphash_3u64(u64, u64, u64, siphash_key_t key);
+u64 siphash_4u64(u64, u64, u64, u64, siphash_key_t key);
+u64 siphash_1u32(u32, siphash_key_t key);
+u64 siphash_2u32(u32, u32, siphash_key_t key);
+u64 siphash_3u32(u32, u32, u32, siphash_key_t key);
+u64 siphash_4u32(u32, u32, u32, u32, siphash_key_t key);
+
+If you pass the generic siphash function something of a constant length, it

[PATCH v7 0/6] The SipHash Patchset

2016-12-21 Thread Jason A. Donenfeld
Hey folks,

Again we've made huge progress, with this latest version now shipping
Jean-Phillipe Aumasson's HalfSipHash, which should be much more competitive
with jhash (in addition to being more secure, of course).

There are dozens of little cleanups and improvements right and left throughout
this series, so I urge you to take a look at the whole thing. I've tried to
take into consideration lots of concerns and suggestions from many of you
over the last week.

There is also now documentation! And the test suite now has full coverage of
all functions.

Finally, there's been some significant benchmarking, and now a few commit
messages have some performance numbers.

Please send your Reviewed-by lines as you see fit.

Regards,
Jason

Jason A. Donenfeld (6):
  siphash: add cryptographically secure PRF
  secure_seq: use SipHash in place of MD5
  random: use SipHash in place of MD5
  md5: remove from lib and only live in crypto
  syncookies: use SipHash in place of SHA1
  siphash: implement HalfSipHash1-3 for hash tables

 Documentation/siphash.txt | 154 +
 MAINTAINERS   |   7 +
 crypto/md5.c  |  95 +++-
 drivers/char/random.c |  84 ---
 include/linux/random.h|   1 -
 include/linux/siphash.h   | 133 +++
 init/main.c   |   1 -
 lib/Kconfig.debug |   6 +-
 lib/Makefile  |   7 +-
 lib/md5.c |  95 
 lib/siphash.c | 548 ++
 lib/test_siphash.c| 208 ++
 net/core/secure_seq.c | 135 +---
 net/ipv4/syncookies.c |  20 +-
 net/ipv6/syncookies.c |  37 ++--
 15 files changed, 1274 insertions(+), 257 deletions(-)
 create mode 100644 Documentation/siphash.txt
 create mode 100644 include/linux/siphash.h
 delete mode 100644 lib/md5.c
 create mode 100644 lib/siphash.c
 create mode 100644 lib/test_siphash.c

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld
On Wed, Dec 21, 2016 at 11:27 PM, Theodore Ts'o  wrote:
> And "with enough registers" includes ARM and MIPS, right?  So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections.  :-)

Plus the benchmark was bogus anyway, and when I built a more specific
harness -- actually comparing the TCP sequence number functions --
SipHash was faster than MD5, even on register starved x86. So I think
we're fine and this chapter of the discussion can come to a close, in
order to move on to more interesting things.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Theodore Ts'o
On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
> SipHash annihilates the competition on 64-bit superscalar hardware.
> SipHash dominates the field on 64-bit in-order hardware.
> SipHash wins easily on 32-bit hardware *with enough registers*.
> On register-starved 32-bit machines, it really struggles.

And "with enough registers" includes ARM and MIPS, right?  So the only
real problem is 32-bit x86, and you're right, at that point, only
people who might care are people who are using a space-radiation
hardened 386 --- and they're not likely to be doing high throughput
TCP connections.  :-)

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] crypto: testmgr: Use heap buffer for acomp test input

2016-12-21 Thread Laura Abbott

Christopher Covington reported a crash on aarch64 on recent Fedora
kernels:

kernel BUG at ./include/linux/scatterlist.h:140!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 752 Comm: cryptomgr_test Not tainted 4.9.0-11815-ge93b1cc #162
Hardware name: linux,dummy-virt (DT)
task: 80007c650080 task.stack: 8891
PC is at sg_init_one+0xa0/0xb8
LR is at sg_init_one+0x24/0xb8
...
[] sg_init_one+0xa0/0xb8
[] test_acomp+0x10c/0x438
[] alg_test_comp+0xb0/0x118
[] alg_test+0x17c/0x2f0
[] cryptomgr_test+0x44/0x50
[] kthread+0xf8/0x128
[] ret_from_fork+0x10/0x50

The test vectors used for input are part of the kernel image. These
inputs are passed as a buffer to sg_init_one which eventually blows up
with BUG_ON(!virt_addr_valid(buf)). On arm64, virt_addr_valid returns
false for the kernel image since virt_to_page will not return the
correct page. Fix this by copying the input vectors to heap buffer
before setting up the scatterlist.

Reported-by: Christopher Covington 
Fixes: d7db7a882deb ("crypto: acomp - update testmgr with support for acomp")
Signed-off-by: Laura Abbott 
---
 crypto/testmgr.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index f616ad7..44e888b 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1461,16 +1461,25 @@ static int test_acomp(struct crypto_acomp *tfm, struct 
comp_testvec *ctemplate,
for (i = 0; i < ctcount; i++) {
unsigned int dlen = COMP_BUF_SIZE;
int ilen = ctemplate[i].inlen;
+   void *input_vec;
 
+   input_vec = kmalloc(ilen, GFP_KERNEL);
+   if (!input_vec) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   memcpy(input_vec, ctemplate[i].input, ilen);
memset(output, 0, dlen);
init_completion();
-   sg_init_one(, ctemplate[i].input, ilen);
+   sg_init_one(, input_vec, ilen);
sg_init_one(, output, dlen);
 
req = acomp_request_alloc(tfm);
if (!req) {
pr_err("alg: acomp: request alloc failed for %s\n",
   algo);
+   kfree(input_vec);
ret = -ENOMEM;
goto out;
}
@@ -1483,6 +1492,7 @@ static int test_acomp(struct crypto_acomp *tfm, struct 
comp_testvec *ctemplate,
if (ret) {
pr_err("alg: acomp: compression failed on test %d for 
%s: ret=%d\n",
   i + 1, algo, -ret);
+   kfree(input_vec);
acomp_request_free(req);
goto out;
}
@@ -1491,6 +1501,7 @@ static int test_acomp(struct crypto_acomp *tfm, struct 
comp_testvec *ctemplate,
pr_err("alg: acomp: Compression test %d failed for %s: 
output len = %d\n",
   i + 1, algo, req->dlen);
ret = -EINVAL;
+   kfree(input_vec);
acomp_request_free(req);
goto out;
}
@@ -1500,26 +1511,37 @@ static int test_acomp(struct crypto_acomp *tfm, struct 
comp_testvec *ctemplate,
   i + 1, algo);
hexdump(output, req->dlen);
ret = -EINVAL;
+   kfree(input_vec);
acomp_request_free(req);
goto out;
}
 
+   kfree(input_vec);
acomp_request_free(req);
}
 
for (i = 0; i < dtcount; i++) {
unsigned int dlen = COMP_BUF_SIZE;
int ilen = dtemplate[i].inlen;
+   void *input_vec;
+
+   input_vec = kmalloc(ilen, GFP_KERNEL);
+   if (!input_vec) {
+   ret = -ENOMEM;
+   goto out;
+   }
 
+   memcpy(input_vec, dtemplate[i].input, ilen);
memset(output, 0, dlen);
init_completion();
-   sg_init_one(, dtemplate[i].input, ilen);
+   sg_init_one(, input_vec, ilen);
sg_init_one(, output, dlen);
 
req = acomp_request_alloc(tfm);
if (!req) {
pr_err("alg: acomp: request alloc failed for %s\n",
   algo);
+   kfree(input_vec);
ret = -ENOMEM;
goto out;
}
@@ -1532,6 +1554,7 @@ static int test_acomp(struct crypto_acomp *tfm, struct 
comp_testvec *ctemplate,
if (ret) {
pr_err("alg: acomp: decompression failed on test %d for 
%s: ret=%d\n",
  

Re: algif for compression?

2016-12-21 Thread Herbert Xu
On Fri, Dec 16, 2016 at 07:41:23PM +0530, abed mohammad kamaluddin wrote:
>
> Thanks Herbert. Are there timelines or  ongoing efforts for moving
> IPcomp/Ipsec to use acomp? Or any proposals that have been or need to
> be taken up in this regard.

Someone needs to write the patches :)
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld
On Wed, Dec 21, 2016 at 7:37 PM, George Spelvin
 wrote:
> SipHash annihilates the competition on 64-bit superscalar hardware.
> SipHash dominates the field on 64-bit in-order hardware.
> SipHash wins easily on 32-bit hardware *with enough registers*.
> On register-starved 32-bit machines, it really struggles.
>
> As I explained, in that last case, SipHash barely wins at all.
> (On a P4, it actually *loses* to MD5, not that anyone cares.  Running
> on a P4 and caring about performance are mutually exclusive.)

>From the discussion off list which examined your benchmark code, it
looks like we're going to move ahead with SipHash.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Eric Dumazet wrote:
> Now I am quite confused.
>
> George said :
>> Cycles per byte on 1024 bytes of data:
>>   Pentium Core 2  Ivy
>>   4   Duo Bridge
>> SipHash-2-4   38.9 8.3 5.8
>> HalfSipHash-2-4   12.7 4.5 3.2
>> MD58.3 5.7 4.7
>
> That really was for 1024 bytes blocks, so pretty much useless for our
> discussion ?

No, they're actually quite relevant, but you have to interpret them
correctly.  I thought I explained in the text following that table,
but let me make it clearer:

To find the time to compute the SipHash of N bytes, round (N+17) up to
the next multiple of 8 bytes and multiply by the numbers above.

To find the time to compute the HalfSipHash of N bytes, round (N+9) up to
the next multiple of 4 bytes and multiply by the numbers above.

To find the time to compute the MD5 of N bytes, round (N+9) up to the
next multiple of 64 bytes and multiply by the numbers above.

It's the different rounding rules that make all the difference.  For small
input blocks, SipHash can be slower per byte yet still faster because
it hashes fewer bytes.

> Reading your numbers last week, I thought SipHash was faster, but George
> numbers are giving the opposite impression.

SipHash annihilates the competition on 64-bit superscalar hardware.
SipHash dominates the field on 64-bit in-order hardware.
SipHash wins easily on 32-bit hardware *with enough registers*.
On register-starved 32-bit machines, it really struggles.

As I explained, in that last case, SipHash barely wins at all.
(On a P4, it actually *loses* to MD5, not that anyone cares.  Running
on a P4 and caring about performance are mutually exclusive.)
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Linus wrote:
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.

I think I've been thoroughly dissuaded, but just to clarify one thing
that resembles a misunderstanding:

> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Everything being discussed is per-TCP-connection overhead, *not* per
packet.  (Twice for outgoing connections, because one is to generate
the ephemeral port number.)

I know you know this, but I don't want anyone spectating to be confused
about it.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Linus Torvalds
On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin
 wrote:
>
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

It's now better than it used to be, but it's absolutely disastrous
still. We're talking easily many hundreds of cycles. Under some loads,
thousands.

And I warn you already: it will _benchmark_ a hell of a lot better
than it will work in reality. In benchmarks, you'll hit all the
optimizations ("oh, I've already saved away all the FP registers, no
need to do it again").

In contrast, in reality, especially with things like "do it once or
twice per incoming packet", you'll easily hit the absolute worst
cases, where not only does it take a few hundred cycles to save the FP
state, you'll then return to user space in between packets, which
triggers the slow-path return code and reloads the FP state, which is
another few hundred cycles plus.

Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX
unit and keep it powered up for a while afterwards", while in real
life you would quite easily hit the "oh, AVX is powered down because
we were idle, now it powers up at half speed which is another latency
hit _and_ the AVX unit won't run full out anyway".

Don't do it. There are basically no real situations where the AVX
state optimizations help for the kernel. We just don't have the loop
counts to make up for the problems it causes.

The one exception is likely if you're doing things like
high-throughput disk IO encryption, and then you'd be much better off
using SHA256 instead (which often has hw encryption on modern CPU's -
both x86 and ARM).

(I'm sure that you could see it on some high-throughput network
benchmark too when the benchmark entirely saturates the CPU. And then
in real life it would suck horribly for all the reasons above).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Eric Dumazet
On Wed, 2016-12-21 at 11:39 -0500, Rik van Riel wrote:

> Does anybody still have a P4?
> 
> If they do, they're probably better off replacing
> it with an Atom. The reduced power bills will pay
> for replacing that P4 within a year or two.

Well, maybe they have millions of units to replace.

> 
> In short, I am not sure how important the P4
> performance numbers are, especially if we can
> improve security for everybody else...

Worth adding that the ISN or syncookie generation are less than 10% of
the actual cost of handling a problematic (having to generate ISN or
syncookie) TCP packet anyway.

So we are talking of minors potential impact for '2000-era' cpus.

Definitely I vote for using SipHash in TCP ASAP.


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] crypto: testmgr: Use linear alias for test input

2016-12-21 Thread Herbert Xu
On Mon, Dec 19, 2016 at 03:37:26PM -0800, Laura Abbott wrote:
> Christopher Covington reported a crash on aarch64 on recent Fedora
> kernels:
> 
> kernel BUG at ./include/linux/scatterlist.h:140!
> Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 2 PID: 752 Comm: cryptomgr_test Not tainted 4.9.0-11815-ge93b1cc #162
> Hardware name: linux,dummy-virt (DT)
> task: 80007c650080 task.stack: 8891
> PC is at sg_init_one+0xa0/0xb8
> LR is at sg_init_one+0x24/0xb8
> ...
> [] sg_init_one+0xa0/0xb8
> [] test_acomp+0x10c/0x438
> [] alg_test_comp+0xb0/0x118
> [] alg_test+0x17c/0x2f0
> [] cryptomgr_test+0x44/0x50
> [] kthread+0xf8/0x128
> [] ret_from_fork+0x10/0x50
> 
> The test vectors used for input are part of the kernel image. These
> inputs are passed as a buffer to sg_init_one which eventually blows up
> with BUG_ON(!virt_addr_valid(buf)). On arm64, virt_addr_valid returns
> false for the kernel image since virt_to_page will not return the
> correct page. The kernel image is also aliased to the linear map so get
> the linear alias and pass that to the scatterlist instead.
> 
> Reported-by: Christopher Covington 
> Fixes: d7db7a882deb ("crypto: acomp - update testmgr with support for acomp")
> Signed-off-by: Laura Abbott 
> ---
> x86 supports virt_addr_valid working on kernel image addresses but arm64 is
> more strict. This is the direction things have been moving with my
> CONFIG_DEBUG_VIRTUAL series for arm64 which is tightening the definition of
> __pa/__pa_symbol.

Please fix this by copying the templates to kmalloced memory like
the other test_* functions.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Test AEAD/authenc algorithms from userspace

2016-12-21 Thread Herbert Xu
On Mon, Dec 19, 2016 at 04:08:11PM +0530, Harsh Jain wrote:
> Hi Herbert,
> 
> TLS default mode of operation is MAC-then-Encrypt for Authenc algos.
> Currently framework only supports EtM used in IPSec. User space
> programs like openssl cannot use af-alg interface to encrypt/decrypt
> in TLS mode.
> Are we going to support Mac-then-Encrypt mode in future kernel releases?

If someone finally adds TLS to the kernel then we'll likely do
something about it.  Otherwise you can just separate it out into
two operations via af-alg.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Rik van Riel
On Wed, 2016-12-21 at 10:55 -0500, George Spelvin wrote:
> Actually, DJB just made a very relevant suggestion.
> 
> As I've mentioned, the 32-bit performance problems are an x86-
> specific
> problem.  ARM does very well, and other processors aren't bad at all.
> 
> SipHash fits very nicely (and runs very fast) in the MMX registers.
> 
> They're 64 bits, and there are 8 of them, so the integer registers
> can
> be reserved for pointers and loop counters and all that.  And there's
> reference code available.
> 
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Those can be very expensive. Almost certainly not
worth it for small amounts of data.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld
Hi Eric,

On Wed, Dec 21, 2016 at 4:56 PM, Eric Dumazet  wrote:
> That really was for 1024 bytes blocks, so pretty much useless for our
> discussion ?
>
> Reading your numbers last week, I thought SipHash was faster, but George
> numbers are giving the opposite impression.
>
> I do not have a P4 to make tests, so I only can trust you or George.

I'm not sure how George came up with those numbers, but the ones I
sent are output from that benchmark function in the last email. I'd be
interested in learning this too.

As mentioned in the last email, it looks like potential 32-bit issues
are really just specific to old Intel chips. Other 32-bit
architectures do fine. So, for new kernels, even if somehow there is a
tiny performance regression (though I couldn't see one) on old
architectures, I really doubt it will affect anybody in practice.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld
Hi George,

On Wed, Dec 21, 2016 at 4:55 PM, George Spelvin
 wrote:
> Actually, DJB just made a very relevant suggestion.
>
> As I've mentioned, the 32-bit performance problems are an x86-specific
> problem.  ARM does very well, and other processors aren't bad at all.
>
> SipHash fits very nicely (and runs very fast) in the MMX registers.
>
> They're 64 bits, and there are 8 of them, so the integer registers can
> be reserved for pointers and loop counters and all that.  And there's
> reference code available.
>
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

In my experience, these functions are only worth calling when
processing more significant amounts of data. I don't have any
benchmarks, but when I _remove_ all of these calls in a kernel,
accelerated crypto gets noticeably faster (until the system crashes).
We can measure it, though.

By the way, if somehow SipHash becomes acceptably fast on x86, would
you consider HalfSipHash for hash tables to be no longer needed? Or do
you suspect that HalfSipHash will always be faster even on, say,
32-bit ARM.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Eric Dumazet
On Wed, 2016-12-21 at 15:42 +0100, Jason A. Donenfeld wrote:
> Hi Eric,
> 
> I computed performance numbers for both 32-bit and 64-bit using the
> actual functions in which talking about replacing MD5 with SipHash.
> The basic harness is here [1] if you're curious. SipHash was a pretty
> clear winner for both cases.
> 
> x86_64:
> [1.714302] secure_tcpv6_sequence_number_md5# cycles: 102373398
> [1.747685] secure_tcp_sequence_number_md5# cycles: 92042258
> [1.773522] secure_tcpv6_sequence_number_siphash# cycles: 70786533
> [1.798701] secure_tcp_sequence_number_siphash# cycles: 68941043
> 
> x86:
> [1.635749] secure_tcpv6_sequence_number_md5# cycles: 106016335
> [1.670259] secure_tcp_sequence_number_md5# cycles: 95670512
> [1.708387] secure_tcpv6_sequence_number_siphash# cycles: 105988635
> [1.740264] secure_tcp_sequence_number_siphash# cycles: 88225395
> 
> >>> 102373398 > 70786533
> True
> >>> 92042258 > 68941043
> True
> >>> 106016335 > 105988635
> True
> >>> 95670512 > 88225395
> True
> 
> While MD5 is probably faster for some kind of large-data
> cycles-per-byte, due to its 64-byte internal state, SipHash -- the
> "Sip" part standing "Short Input PRF" -- is fast for shorter inputs.
> In practice with the functions we're talking about replacing, there's
> no need to hash 64-bytes. So, SipHash comes out faster and more
> secure.
> 
> I also haven't begun to look focusedly at the assembly my SipHash
> implemention is generating, which means there's still window for even
> more performance improvements.
> 
> Jason
> 
> 
> [1] 
> https://git.zx2c4.com/linux-dev/tree/net/core/secure_seq.c?h=siphash-bench#n194

Now I am quite confused.

George said :

> Cycles per byte on 1024 bytes of data:
>   Pentium Core 2  Ivy
>   4   Duo Bridge
> SipHash-2-4   38.9 8.3 5.8
> HalfSipHash-2-4   12.7 4.5 3.2
> MD58.3 5.7 4.7


That really was for 1024 bytes blocks, so pretty much useless for our
discussion ?

Reading your numbers last week, I thought SipHash was faster, but George
numbers are giving the opposite impression.

I do not have a P4 to make tests, so I only can trust you or George.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Actually, DJB just made a very relevant suggestion.

As I've mentioned, the 32-bit performance problems are an x86-specific
problem.  ARM does very well, and other processors aren't bad at all.

SipHash fits very nicely (and runs very fast) in the MMX registers.

They're 64 bits, and there are 8 of them, so the integer registers can
be reserved for pointers and loop counters and all that.  And there's
reference code available.

How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Although there are a lot of pre-MMX x86es in embedded control applications,
I don't think anyone is worried about their networking performance.
(Specifically, all of this affects only connection setup, not throughput 
on established connections.)
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld
Hi Eric,

I computed performance numbers for both 32-bit and 64-bit using the
actual functions in which talking about replacing MD5 with SipHash.
The basic harness is here [1] if you're curious. SipHash was a pretty
clear winner for both cases.

x86_64:
[1.714302] secure_tcpv6_sequence_number_md5# cycles: 102373398
[1.747685] secure_tcp_sequence_number_md5# cycles: 92042258
[1.773522] secure_tcpv6_sequence_number_siphash# cycles: 70786533
[1.798701] secure_tcp_sequence_number_siphash# cycles: 68941043

x86:
[1.635749] secure_tcpv6_sequence_number_md5# cycles: 106016335
[1.670259] secure_tcp_sequence_number_md5# cycles: 95670512
[1.708387] secure_tcpv6_sequence_number_siphash# cycles: 105988635
[1.740264] secure_tcp_sequence_number_siphash# cycles: 88225395

>>> 102373398 > 70786533
True
>>> 92042258 > 68941043
True
>>> 106016335 > 105988635
True
>>> 95670512 > 88225395
True

While MD5 is probably faster for some kind of large-data
cycles-per-byte, due to its 64-byte internal state, SipHash -- the
"Sip" part standing "Short Input PRF" -- is fast for shorter inputs.
In practice with the functions we're talking about replacing, there's
no need to hash 64-bytes. So, SipHash comes out faster and more
secure.

I also haven't begun to look focusedly at the assembly my SipHash
implemention is generating, which means there's still window for even
more performance improvements.

Jason


[1] 
https://git.zx2c4.com/linux-dev/tree/net/core/secure_seq.c?h=siphash-bench#n194
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Jason A. Donenfeld
Hi George,

On Wed, Dec 21, 2016 at 7:34 AM, George Spelvin
 wrote:
> In fact, I have an idea.  Allow me to make the following concrete
> suggestion for using HalfSipHash with 128 bits of key material:
>
> - 64 bits are used as the key.
> - The other 64 bits are used as an IV which is prepended to
>   the message to be hashed.
>
> As a matter of practical implementation, we precompute the effect
> of hashing the IV and store the 128-bit HalfSipHash state, which
> is used just like a 128-bit key.
>
> Because of the way it is constructed, it is obviously no weaker than
> standard HalfSipHash's 64-bit security claim.
>
> I don't know the security of this, and it's almost certainly weaker than
> 128 bits, but I *hope* it's at least a few bits stronger than 64 bits.
> 80 would be enough to dissuade any attacker without a six-figure budget
> (that's per attack, not a one-time capital investment).  96 would be
> ample for our purposes.
>
> What I do know is that it makes a brute-force attack without
> significant cryptanalytic effort impossible.

Depends who's doing the cryptanalytic effort I guess.

Please don't roll your own crypto. It's a dangerous road. Putting
homebrew crypto into the kernel would be an error. Let's stick with
the constructions and security margins that the cryptographers give
us. JP made that fairly clear, I thought.

There are already people working on this problem who undergo peer
review and a career devoted to solving these problems. One result for
small systems that need 128-bit security is Chaskey, which you can go
read about if you're curious.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] drivers: crypto: Add the Virtual Function driver for CPT

2016-12-21 Thread Corentin Labbe
Hello

I have some comment inline

On Wed, Dec 21, 2016 at 11:56:12AM +, george.cher...@cavium.com wrote:
> From: George Cherian 
> 
> Enable the CPT VF driver. CPT is the cryptographic Accelaration Unit

typo acceleration

[...]
> +static inline void update_input_data(struct cpt_request_info *req_info,
> +  struct scatterlist *inp_sg,
> +  u32 nbytes, u32 *argcnt)
> +{
> + req_info->req.dlen += nbytes;
> +
> + while (nbytes) {
> + u32 len = min(nbytes, inp_sg->length);
> + u8 *ptr = page_address(sg_page(inp_sg)) + inp_sg->offset;

You could use sg_virt instead.

But do you have tested your accelerator with user space data (via 
cryptodev/AF_ALG) ?
In my memory, you better use kmap() instead of this direct memory address.

[...]
> +static inline u32 cvm_enc_dec(struct ablkcipher_request *req, u32 enc,
> +   u32 cipher_type)
> +{
> + struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(req);
> + struct cvm_enc_ctx *ctx = crypto_ablkcipher_ctx(tfm);
> + u32 key_type = AES_128_BIT;
> + struct cvm_req_ctx *rctx = ablkcipher_request_ctx(req);
> + u32 enc_iv_len = crypto_ablkcipher_ivsize(tfm);
> + struct fc_context *fctx = >fctx;
> + struct cpt_request_info *req_info = >cpt_req;
> + void *cdev = NULL;
> + u32 status = -1;

Doable but dangerous
Furthermore, cptvf_do_request return int so why use u32 ?

[...]
> +void cvm_enc_dec_exit(struct crypto_tfm *tfm)
> +{
> + return;
> +}

So you could remove all reference to this function

[...]
> +static inline int cav_register_algs(void)
> +{
> + int err = 0;
> +
> + err = crypto_register_algs(algs, ARRAY_SIZE(algs));
> + if (err) {
> + pr_err("Error in aes module init %d\n", err);
> + return -1;

This is not a standard error code

[...]
> diff --git a/drivers/crypto/cavium/cpt/cptvf_algs.h 
> b/drivers/crypto/cavium/cpt/cptvf_algs.h
> new file mode 100644
> index 000..fcb287b
> --- /dev/null
> +++ b/drivers/crypto/cavium/cpt/cptvf_algs.h
[...]
> +
> +u32 cptvf_do_request(void *cptvf, struct cpt_request_info *req);

latter this function is set "return int"

[...]
> +static int cptvf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> +{
> + struct device *dev = >dev;
> + struct cpt_vf *cptvf;
> + interr;
> +
> + cptvf = devm_kzalloc(dev, sizeof(struct cpt_vf), GFP_KERNEL);

use sizeof(*cptvf) and checkpatch

[...]
> +static int setup_sgio_components(struct cpt_vf *cptvf, struct buf_ptr *list,
> +  int buf_count, u8 *buffer)
> +{
> + int ret = 0, i, j;
> + int components;
> + struct sglist_component *sg_ptr = NULL;
> + struct pci_dev *pdev = cptvf->pdev;
> +
> + if (unlikely(!list)) {
> + pr_err("Input List pointer is NULL\n");
> + ret = -EFAULT;
> + return ret;

You could directly return -EFAULT and use dev_err()

> + }
> +
> + for (i = 0; i < buf_count; i++) {
> + if (likely(list[i].vptr)) {
> + list[i].dma_addr = dma_map_single(>dev,
> +   list[i].vptr,
> +   list[i].size,
> +   DMA_BIDIRECTIONAL);
> + if (unlikely(dma_mapping_error(>dev,
> +list[i].dma_addr))) {
> + pr_err("DMA map kernel buffer failed for 
> component: %d\n",
> +i);

Use dev_err

[...]
> + u16 g_sz_bytes = 0, s_sz_bytes = 0;
> + int ret = 0;
> + struct pci_dev *pdev = cptvf->pdev;
> +
> + if (req->incnt > MAX_SG_IN_CNT || req->outcnt > MAX_SG_OUT_CNT) {
> + pr_err("Requestes SG components are higher than supported\n");

typo request and use dev_err

In all files you have some pr_x that could be better use as dev_x

> + ret = -EINVAL;
> + goto  scatter_gather_clean;
> + }
> +
> + /* Setup gather (input) components */
> + g_sz_bytes = ((req->incnt + 3) / 4) * sizeof(struct sglist_component);
> + info->gather_components = kzalloc((g_sz_bytes), GFP_KERNEL);

unnecessary parenthesis

> + if (!info->gather_components) {
> + ret = -ENOMEM;
> + goto  scatter_gather_clean;
> + }
> +
> + ret = setup_sgio_components(cptvf, req->in,
> + req->incnt,
> + info->gather_components);
> + if (ret) {
> + pr_err("Failed to setup gather list\n");
> + ret = -EFAULT;
> + goto  scatter_gather_clean;
> + }
> +
> + /* Setup scatter (output) components */
> + s_sz_bytes = ((req->outcnt + 3) / 4) * sizeof(struct sglist_component);
> + 

Re: dm-crypt optimization

2016-12-21 Thread Milan Broz
On 12/20/2016 10:41 AM, Binoy Jayan wrote:
> At a high level the goal is to maximize the size of data blocks that get 
> passed
> to hardware accelerators, minimizing the overhead from setting up and tearing
> down operations in the hardware. Currently dm-crypt itself is a big blocker as
> it manually implements ESSIV and similar algorithms which allow per-block
> encryption of the data so the low level operations from the crypto API can
> only operate on a single block. This is done because currently the crypto API
> doesn't have software implementations of these algorithms itself so dm-crypt
> can't rely on it being able to provide the functionality. The plan to address
> this was to provide some software implementations in the crypto API, then
> update dm-crypt to rely on those. Even for a pure software implementation
> with no hardware acceleration that should hopefully provide a small
> optimization as we need to call into the crypto API less often but it's likely
> to be marginal given the overhead of crypto, the real win would be on a system
> that has an accelerator that can replace the software implementation.
> 
> Currently dm-crypt handles data only in single blocks. This means that it 
> can't
> make good use of hardware cryptography engines since there is an overhead to
> each transaction with the engine but transfers must be split into block sized
> chunks. Allowing the transfer of larger blocks e.g. 'struct bio', could
> mitigate against these costs and could improve performance in operating 
> systems
> with encrypted filesystems. Although qualcomm chipsets support another variant
> of the device-mapper dm-req-crypt, it is not something generic and in
> mainline-able state. Also, it only supports 'XTS-AES' mode of encryption and
> is not compatible with other modes supported by dm-crypt.

So the core problem is that your crypto accelerator can operate efficiently only
with bigger batch sizes.

How big blocks your crypto hw need to be able to operate more efficiently?
What about 4k blocks (no batches), could it be usable trade-off?

With some (backward incompatible) changes in LUKS format I would like to see 
support
for encryption blocks equivalent to sectors size, so it basically means for 4k 
drive 4k
encryption block.
(This should decrease overhead, now is everything processed on 512 blocks only.)

Support of bigger block sizes would be unsafe without additional mechanism that 
provides
atomic writes of multiple sectors. Maybe it applies to 4k as well on some 
devices though...)

The above is not going against your proposal, I am just curious if this is 
enough
to provide better performance on your hw accelerator or not.

Milan

> However, there are some challenges and a few possibilities to address this. I
> request you to provide your suggestions on whether the points mentioned below
> makes sense and if it could be done differently.
> 
> 1. Move the 'real' IV generation algorithms to crypto layer (e.g. essiv)
> 2. Increase the 'length' of the scatterlist nodes used in the crypto api. It
>can be made equal to the size of a main memory segment (as defined in
>'struct bio') as they are physcially contiguous.
> 3. Multiple segments in 'struct bio' can be represented as scatterlist of all
>segments in a 'struct bio'.
> 
> 4. Move algorithms 'lmk' and 'tcw' (which are IV combined with hacks to the
>cbc mode) to create a customized cbc algorithm, implemented in a seperate
>file (e.g. cbc_lmk.c/cbc_tcw.c). As Milan suggested, these can't be treated
>as real IVs as these include hacks to the cbc mode (and directly manipulate
>encrypted data).
> 
> 5. Move key selection logic to user space or always assume keycount as '1'
>(as mentioned in the dm-crypt param format below) so that the key selection
>logic does not have to be dependent on the sector number. This is necessary
>as the key is selected otherwise based on sector number:
> 
>key_index = sector & (key_count - 1)
> 
>If block size for scatterlist nodes are increased beyond sector boundary
>(which is what we plan to achieve, for performance), the key set for every
>cipher operation cannot be changed at the sector level.
> 
>dm-crypt param format : cipher[:keycount]-mode-iv:ivopts
>Example   : aes:2-cbc-essiv:sha256
> 
>Also as Milan suggested, it is not wise to move the key selection logic to
>the crypto layer as it will prevent any changes to the key structure later.
> 
> The following is a reference to an earlier patchset. It had the cipher mode
> 'cbc' mixed up with the IV algorithms and is usually not the preferred way.
> 
> Reference:
> https://lkml.org/lkml/2016/12/13/65
> https://lkml.org/lkml/2016/12/13/66
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/3] drivers: crypto: Enable CPT options crypto for build

2016-12-21 Thread george.cherian
From: George Cherian 

Add the CPT options in crypto Kconfig and update the
crypto Makefile

Signed-off-by: George Cherian 
Reviewed-by: David Daney 
---
 drivers/crypto/Kconfig  | 1 +
 drivers/crypto/Makefile | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig
index 4d2b81f..15f9040 100644
--- a/drivers/crypto/Kconfig
+++ b/drivers/crypto/Kconfig
@@ -484,6 +484,7 @@ config CRYPTO_DEV_MXS_DCP
  will be called mxs-dcp.
 
 source "drivers/crypto/qat/Kconfig"
+source "drivers/crypto/cavium/cpt/Kconfig"
 
 config CRYPTO_DEV_QCE
tristate "Qualcomm crypto engine accelerator"
diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index ad7250f..dd33290 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -32,3 +32,4 @@ obj-$(CONFIG_CRYPTO_DEV_VMX) += vmx/
 obj-$(CONFIG_CRYPTO_DEV_SUN4I_SS) += sunxi-ss/
 obj-$(CONFIG_CRYPTO_DEV_ROCKCHIP) += rockchip/
 obj-$(CONFIG_CRYPTO_DEV_CHELSIO) += chelsio/
+obj-$(CONFIG_CRYPTO_DEV_CPT) += cavium/cpt/
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/3] drivers: crypto: Add the Virtual Function driver for CPT

2016-12-21 Thread george.cherian
From: George Cherian 

Enable the CPT VF driver. CPT is the cryptographic Accelaration Unit
in Octeon-tx series of processors.

Signed-off-by: George Cherian 
Reviewed-by: David Daney 
---
 drivers/crypto/cavium/cpt/Makefile   |   3 +-
 drivers/crypto/cavium/cpt/cptvf.h| 135 
 drivers/crypto/cavium/cpt/cptvf_algs.c   | 424 
 drivers/crypto/cavium/cpt/cptvf_algs.h   | 110 
 drivers/crypto/cavium/cpt/cptvf_main.c   | 945 +++
 drivers/crypto/cavium/cpt/cptvf_mbox.c   | 205 ++
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 581 
 drivers/crypto/cavium/cpt/request_manager.h  | 147 +
 8 files changed, 2549 insertions(+), 1 deletion(-)
 create mode 100644 drivers/crypto/cavium/cpt/cptvf.h
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.c
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.h
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_main.c
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_mbox.c
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_reqmanager.c
 create mode 100644 drivers/crypto/cavium/cpt/request_manager.h

diff --git a/drivers/crypto/cavium/cpt/Makefile 
b/drivers/crypto/cavium/cpt/Makefile
index fe3d454..dbf055e 100644
--- a/drivers/crypto/cavium/cpt/Makefile
+++ b/drivers/crypto/cavium/cpt/Makefile
@@ -1,2 +1,3 @@
-obj-$(CONFIG_CAVIUM_CPT) += cptpf.o
+obj-$(CONFIG_CAVIUM_CPT) += cptpf.o cptvf.o
 cptpf-objs := cptpf_main.o cptpf_mbox.o
+cptvf-objs := cptvf_main.o cptvf_reqmanager.o cptvf_mbox.o cptvf_algs.o
diff --git a/drivers/crypto/cavium/cpt/cptvf.h 
b/drivers/crypto/cavium/cpt/cptvf.h
new file mode 100644
index 000..1cc04aa
--- /dev/null
+++ b/drivers/crypto/cavium/cpt/cptvf.h
@@ -0,0 +1,135 @@
+/*
+ * Copyright (C) 2016 Cavium, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+#ifndef __CPTVF_H
+#define __CPTVF_H
+
+#include 
+#include "cpt_common.h"
+
+/* Default command queue length */
+#define CPT_CMD_QLEN 2046
+#define CPT_CMD_QCHUNK_SIZE 1023
+
+/* Default command timeout in seconds */
+#define CPT_COMMAND_TIMEOUT 4
+#define CPT_TIMER_THOLD0x
+#define CPT_NUM_QS_PER_VF 1
+#define CPT_INST_SIZE 64
+#define CPT_NEXT_CHUNK_PTR_SIZE 8
+
+#defineCPT_VF_MSIX_VECTORS 2
+#define CPT_VF_INTR_MBOX_MASK BIT(0)
+#define CPT_VF_INTR_DOVF_MASK BIT(1)
+#define CPT_VF_INTR_IRDE_MASK BIT(2)
+#define CPT_VF_INTR_NWRP_MASK BIT(3)
+#define CPT_VF_INTR_SERR_MASK BIT(4)
+#define DMA_DIRECT_DIRECT 0 /* Input DIRECT, Output DIRECT */
+#define DMA_GATHER_SCATTER 1
+#define FROM_DPTR 1
+
+/**
+ * Enumeration cpt_vf_int_vec_e
+ *
+ * CPT VF MSI-X Vector Enumeration
+ * Enumerates the MSI-X interrupt vectors.
+ */
+enum cpt_vf_int_vec_e {
+   CPT_VF_INT_VEC_E_MISC = 0x00,
+   CPT_VF_INT_VEC_E_DONE = 0x01
+};
+
+struct command_chunk {
+   u8 *head;
+   dma_addr_t dma_addr;
+   u32 size; /* Chunk size, max CPT_INST_CHUNK_MAX_SIZE */
+   struct hlist_node nextchunk;
+};
+
+struct command_queue {
+   spinlock_t lock; /* command queue lock */
+   u32 idx; /* Command queue host write idx */
+   u32 nchunks; /* Number of command chunks */
+   struct command_chunk *qhead;/* Command queue head, instructions
+* are inserted here
+*/
+   struct hlist_head chead;
+};
+
+struct command_qinfo {
+   u32 cmd_size;
+   u32 qchunksize; /* Command queue chunk size */
+   struct command_queue queue[CPT_NUM_QS_PER_VF];
+};
+
+struct pending_entry {
+   u8 busy; /* Entry status (free/busy) */
+
+   volatile u64 *completion_addr; /* Completion address */
+   void *post_arg;
+   void (*callback)(int, void *); /* Kernel ASYNC request callabck */
+   void *callback_arg; /* Kernel ASYNC request callabck arg */
+};
+
+struct pending_queue {
+   struct pending_entry *head; /* head of the queue */
+   u32 front; /* Process work from here */
+   u32 rear; /* Append new work here */
+   atomic64_t pending_count;
+   spinlock_t lock; /* Queue lock */
+};
+
+struct pending_qinfo {
+   u32 nr_queues;  /* Number of queues supported */
+   u32 qlen; /* Queue length */
+   struct pending_queue queue[CPT_NUM_QS_PER_VF];
+};
+
+#define for_each_pending_queue(qinfo, q, i)\
+   for (i = 0, q = >queue[i]; i < qinfo->nr_queues; i++, \
+q = >queue[i])
+
+struct cpt_vf {
+   u16 flags; /* Flags to hold device status bits */
+   u8 vfid; /* Device Index 0...CPT_MAX_VF_NUM */
+   u8 vftype; /* VF type of SE_TYPE(1) or AE_TYPE(1) */
+   u8 vfgrp; /* VF group (0 - 8) */
+   u8 node; /* Operating node: Bits (46:44) in BAR0 address */
+   u8 

[PATCH v3 1/3] drivers: crypto: Add Support for Octeon-tx CPT Engine

2016-12-21 Thread george.cherian
From: George Cherian 

Enable the Physical Function diver for the Cavium Crypto Engine (CPT)
found in Octeon-tx series of SoC's. CPT is the Cryptographic Acceleration
Unit. CPT includes microcoded GigaCypher symmetric engines (SEs) and
asymmetric engines (AEs).

Signed-off-by: George Cherian 
Reviewed-by: David Daney 
---
 drivers/crypto/cavium/cpt/Kconfig|  16 +
 drivers/crypto/cavium/cpt/Makefile   |   2 +
 drivers/crypto/cavium/cpt/cpt_common.h   | 158 +++
 drivers/crypto/cavium/cpt/cpt_hw_types.h | 658 +
 drivers/crypto/cavium/cpt/cptpf.h|  69 +++
 drivers/crypto/cavium/cpt/cptpf_main.c   | 703 +++
 drivers/crypto/cavium/cpt/cptpf_mbox.c   | 163 +++
 7 files changed, 1769 insertions(+)
 create mode 100644 drivers/crypto/cavium/cpt/Kconfig
 create mode 100644 drivers/crypto/cavium/cpt/Makefile
 create mode 100644 drivers/crypto/cavium/cpt/cpt_common.h
 create mode 100644 drivers/crypto/cavium/cpt/cpt_hw_types.h
 create mode 100644 drivers/crypto/cavium/cpt/cptpf.h
 create mode 100644 drivers/crypto/cavium/cpt/cptpf_main.c
 create mode 100644 drivers/crypto/cavium/cpt/cptpf_mbox.c

diff --git a/drivers/crypto/cavium/cpt/Kconfig 
b/drivers/crypto/cavium/cpt/Kconfig
new file mode 100644
index 000..247f1cb
--- /dev/null
+++ b/drivers/crypto/cavium/cpt/Kconfig
@@ -0,0 +1,16 @@
+#
+# Cavium crypto device configuration
+#
+
+config CRYPTO_DEV_CPT
+   tristate
+
+config CAVIUM_CPT
+   tristate "Cavium Cryptographic Accelerator driver"
+   depends on ARCH_THUNDER
+   select CRYPTO_DEV_CPT
+   help
+ Support for Cavium CPT block found in octeon-tx series of
+ processors.
+
+ To compile this as a module, choose M here.
diff --git a/drivers/crypto/cavium/cpt/Makefile 
b/drivers/crypto/cavium/cpt/Makefile
new file mode 100644
index 000..fe3d454
--- /dev/null
+++ b/drivers/crypto/cavium/cpt/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_CAVIUM_CPT) += cptpf.o
+cptpf-objs := cptpf_main.o cptpf_mbox.o
diff --git a/drivers/crypto/cavium/cpt/cpt_common.h 
b/drivers/crypto/cavium/cpt/cpt_common.h
new file mode 100644
index 000..ede612f
--- /dev/null
+++ b/drivers/crypto/cavium/cpt/cpt_common.h
@@ -0,0 +1,158 @@
+/*
+ * Copyright (C) 2016 Cavium, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+#ifndef __CPT_COMMON_H
+#define __CPT_COMMON_H
+
+#include 
+#include 
+#include 
+
+#include "cpt_hw_types.h"
+
+/* Device ID */
+#define CPT_81XX_PCI_PF_DEVICE_ID 0xa040
+#define CPT_81XX_PCI_VF_DEVICE_ID 0xa041
+
+/* flags to indicate the features supported */
+#define CPT_FLAG_MSIX_ENABLED BIT(0)
+#define CPT_FLAG_SRIOV_ENABLED BIT(1)
+#define CPT_FLAG_VF_DRIVER BIT(2)
+#define CPT_FLAG_DEVICE_READY BIT(3)
+
+#define cpt_msix_enabled(cpt) ((cpt)->flags & CPT_FLAG_MSIX_ENABLED)
+#define cpt_sriov_enabled(cpt) ((cpt)->flags & CPT_FLAG_SRIOV_ENABLED)
+#define cpt_vf_driver(cpt) ((cpt)->flags & CPT_FLAG_VF_DRIVER)
+#define cpt_device_ready(cpt) ((cpt)->flags & CPT_FLAG_DEVICE_READY)
+
+#define CPT_MBOX_MSG_TYPE_ACK 1
+#define CPT_MBOX_MSG_TYPE_NACK 2
+#define CPT_MBOX_MSG_TIMEOUT 2000
+#define VF_STATE_DOWN 0
+#define VF_STATE_UP 1
+
+/*
+ * CPT Registers map for 81xx
+ */
+
+/* PF registers */
+#define CPTX_PF_CONSTANTS(a) (0x0ll + ((u64)(a) << 36))
+#define CPTX_PF_RESET(a) (0x100ll + ((u64)(a) << 36))
+#define CPTX_PF_DIAG(a) (0x120ll + ((u64)(a) << 36))
+#define CPTX_PF_BIST_STATUS(a) (0x160ll + ((u64)(a) << 36))
+#define CPTX_PF_ECC0_CTL(a) (0x200ll + ((u64)(a) << 36))
+#define CPTX_PF_ECC0_FLIP(a) (0x210ll + ((u64)(a) << 36))
+#define CPTX_PF_ECC0_INT(a) (0x220ll + ((u64)(a) << 36))
+#define CPTX_PF_ECC0_INT_W1S(a) (0x230ll + ((u64)(a) << 36))
+#define CPTX_PF_ECC0_ENA_W1S(a)(0x240ll + ((u64)(a) << 36))
+#define CPTX_PF_ECC0_ENA_W1C(a)(0x250ll + ((u64)(a) << 36))
+#define CPTX_PF_MBOX_INTX(a, b)\
+   (0x400ll + ((u64)(a) << 36) + ((b) << 3))
+#define CPTX_PF_MBOX_INT_W1SX(a, b) \
+   (0x420ll + ((u64)(a) << 36) + ((b) << 3))
+#define CPTX_PF_MBOX_ENA_W1CX(a, b) \
+   (0x440ll + ((u64)(a) << 36) + ((b) << 3))
+#define CPTX_PF_MBOX_ENA_W1SX(a, b) \
+   (0x460ll + ((u64)(a) << 36) + ((b) << 3))
+#define CPTX_PF_EXEC_INT(a) (0x500ll + 0x10ll * ((a) & 0x1))
+#define CPTX_PF_EXEC_INT_W1S(a)(0x520ll + ((u64)(a) << 36))
+#define CPTX_PF_EXEC_ENA_W1C(a)(0x540ll + ((u64)(a) << 36))
+#define CPTX_PF_EXEC_ENA_W1S(a)(0x560ll + ((u64)(a) << 36))
+#define CPTX_PF_GX_EN(a, b) \
+   (0x600ll + ((u64)(a) << 36) + ((b) << 3))
+#define CPTX_PF_EXEC_INFO(a) (0x700ll + ((u64)(a) << 36))
+#define CPTX_PF_EXEC_BUSY(a) (0x800ll + ((u64)(a) << 36))
+#define CPTX_PF_EXEC_INFO0(a) (0x900ll + ((u64)(a) << 36))

[PATCH v3 0/3] Add Support for Cavium Cryptographic Acceleration Unit

2016-12-21 Thread george.cherian
From: George Cherian 

This series adds the support for Cavium Cryptographic Accelerarion Unit (CPT) 
CPT is available in Cavium's Octeon-Tx SoC series.
  
The series was tested with ecryptfs and dm-crypt for in kernel cryptographic
offload operations. This driver needs a firmware to work, I will be sending the
firmware to linux-firmware once the driver is accepted.

Changes v2 -> v3
-- Addressed David Daney's comments
- There is not much difference in performance readq/writeq vs 
readq_relaxed/writeq_relaxed, so switching to readq/writeq variant.
- Removed the useless bitfield definitions.
- Use GENMASK,dev_to_node() instead of custome functions.
- Use module_pci_driver instead of module_init/exit.
Changes v1 -> v2  
-- Addressed a crash issue when more gather components are passed.
-- Redo the cptvf request manager.
 - Get rid of the un necessary buffer copies.  
-- s/uint*_t/u*
-- Remove unwanted Macro definitions   
-- Remove the redundant ROUNDUP* macros and use kernel function
-- Select proper config option in Kconfig file.
-- Removed some of the unwanted header file inclusions
-- Miscellaneous Cleanup 

George Cherian (3):
  drivers: crypto: Add Support for Octeon-tx CPT Engine
  drivers: crypto: Add the Virtual Function driver for CPT
  drivers: crypto: Enable CPT options crypto for build

 drivers/crypto/Kconfig   |   1 +
 drivers/crypto/Makefile  |   1 +
 drivers/crypto/cavium/cpt/Kconfig|  16 +
 drivers/crypto/cavium/cpt/Makefile   |   3 +
 drivers/crypto/cavium/cpt/cpt_common.h   | 158 +
 drivers/crypto/cavium/cpt/cpt_hw_types.h | 658 +++
 drivers/crypto/cavium/cpt/cptpf.h|  69 ++
 drivers/crypto/cavium/cpt/cptpf_main.c   | 703 
 drivers/crypto/cavium/cpt/cptpf_mbox.c   | 163 +
 drivers/crypto/cavium/cpt/cptvf.h| 135 
 drivers/crypto/cavium/cpt/cptvf_algs.c   | 424 
 drivers/crypto/cavium/cpt/cptvf_algs.h   | 110 
 drivers/crypto/cavium/cpt/cptvf_main.c   | 945 +++
 drivers/crypto/cavium/cpt/cptvf_mbox.c   | 205 ++
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 581 
 drivers/crypto/cavium/cpt/request_manager.h  | 147 +
 16 files changed, 4319 insertions(+)
 create mode 100644 drivers/crypto/cavium/cpt/Kconfig
 create mode 100644 drivers/crypto/cavium/cpt/Makefile
 create mode 100644 drivers/crypto/cavium/cpt/cpt_common.h
 create mode 100644 drivers/crypto/cavium/cpt/cpt_hw_types.h
 create mode 100644 drivers/crypto/cavium/cpt/cptpf.h
 create mode 100644 drivers/crypto/cavium/cpt/cptpf_main.c
 create mode 100644 drivers/crypto/cavium/cpt/cptpf_mbox.c
 create mode 100644 drivers/crypto/cavium/cpt/cptvf.h
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.c
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_algs.h
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_main.c
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_mbox.c
 create mode 100644 drivers/crypto/cavium/cpt/cptvf_reqmanager.c
 create mode 100644 drivers/crypto/cavium/cpt/request_manager.h

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html