[PATCH v3] ubsan: Avoid unnecessary 128-bit shifts

2019-04-04 Thread George Spelvin
If CONFIG_ARCH_SUPPORTS_INT128, s_max is 128 bits, and variable
sign-extending shifts of such a double-word data type are a non-trivial
amount of code and complexity.  Do a single-word sign-extension *before*
the cast to (s_max), greatly simplifying the object code.

Rasmus Villemoes suggested using sign_extend* from .

On s390 (and perhaps some other arches), gcc implements variable
128-bit shifts using an __ashrti3 helper function which the kernel
doesn't provide, causing a link error.  In that case, this patch is
a prerequisite for enabling INT128 support.  Andrey Ryabinin has gven
permission for any arch that needs it to cherry-pick it so they don't
have to wait for ubsan to be merged into Linus' tree.

We *could*, alternatively, implement __ashrti3, but that becomes dead as
soon as this patch is merged, so it seems like a waste of time and its
absence discourages people from adding inefficient code.  Note that the
shifts in  (unsigned, and by a compile-time constant amount)
are simpler and generated inline.

Signed-off-by: George Spelvin 
Acked-By: Andrey Ryabinin 
Feedback-from: Rasmus Villemoes 
Cc: linux-s...@vger.kernel.org
Cc: Heiko Carstens 
---
 include/linux/bitops.h |  7 +++
 lib/ubsan.c| 13 +
 2 files changed, 12 insertions(+), 8 deletions(-)

v3: Added sign_extend_long() to sign_extend{32,64} in .
Used sign_extend_long rather than hand-rolling sign extension.
Changed to more uniform if ... else if ... else ... structure.
v2: Eliminated redundant cast to (s_max).
Rewrote commit message without "is this the right thing to do?"
verbiage.
Incorporated ack from Andrey Ryabinin.

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 705f7c442691..8d33c2bfe6c5 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -157,6 +157,13 @@ static inline __s64 sign_extend64(__u64 value, int index)
return (__s64)(value << shift) >> shift;
 }
 
+static inline long sign_extend_long(unsigned long value, int index)
+{
+   if (sizeof(value) == 4)
+   return sign_extend32(value);
+   return sign_extend64(value);
+}
+
 static inline unsigned fls_long(unsigned long l)
 {
if (sizeof(l) == 4)
diff --git a/lib/ubsan.c b/lib/ubsan.c
index e4162f59a81c..24d4920317e4 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -88,15 +88,12 @@ static bool is_inline_int(struct type_descriptor *type)
 
 static s_max get_signed_val(struct type_descriptor *type, unsigned long val)
 {
-   if (is_inline_int(type)) {
-   unsigned extra_bits = sizeof(s_max)*8 - type_bit_width(type);
-   return ((s_max)val) << extra_bits >> extra_bits;
-   }
+   if (is_inline_int(type))
+   return sign_extend_long(val, type_bit_width(type) - 1);
-
-   if (type_bit_width(type) == 64)
+   else if (type_bit_width(type) == 64)
return *(s64 *)val;
-
-   return *(s_max *)val;
+   else
+   return *(s_max *)val;
 }
 
 static bool val_is_negative(struct type_descriptor *type, unsigned long val)
-- 
2.20.1



[PATCH v2] ubsan: Avoid unnecessary 128-bit shifts

2019-04-02 Thread George Spelvin
If CONFIG_ARCH_SUPPORTS_INT128, s_max is 128 bits, and variable
sign-extending shifts of such a double-word data type are a non-trivial
amount of code and complexity.  Do a single-word shift *before* the cast
to (s_max), greatly simplifying the object code.

(Yes, I know "signed long" is redundant.  It's there for emphasis.)

On s390 (and perhaps some other arches), gcc implements variable
128-bit shifts using an __ashrti3 helper function which the kernel
doesn't provide, causing a link error.  In that case, this patch is
a prerequisite for enabling INT128 support.  Andrey Ryabinin has gven
permission for any arch that needs it to cherry-pick it so they don't
have to wait for ubsan to be merged into Linus' tree.

We *could*, alternatively, implement __ashrti3, but that becomes dead as
soon as this patch is merged, so it seems like a waste of time and its
absence discourages people from adding inefficient code.  Note that the
shifts in  (unsigned, and by a compile-time constant amount)
are simpler and generated inline.

Signed-off-by: George Spelvin 
Acked-By: Andrey Ryabinin 
Cc: linux-s...@vger.kernel.org
Cc: Heiko Carstens 
---
 lib/ubsan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

v1->v2: Eliminated redundant cast to (s_max).
Rewrote commit message without "is this the right thing to do?"
verbiage.
Incorporated ack from Andrey Ryabinin.

diff --git a/lib/ubsan.c b/lib/ubsan.c
index e4162f59a81c..a7eb55fbeede 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -89,8 +89,8 @@ static bool is_inline_int(struct type_descriptor *type)
 static s_max get_signed_val(struct type_descriptor *type, unsigned long val)
 {
if (is_inline_int(type)) {
-   unsigned extra_bits = sizeof(s_max)*8 - type_bit_width(type);
-   return ((s_max)val) << extra_bits >> extra_bits;
+   unsigned extra_bits = sizeof(val)*8 - type_bit_width(type);
+   return (signed long)val << extra_bits >> extra_bits;
}
 
if (type_bit_width(type) == 64)
-- 
2.20.1



[PATCH] ubsan: Avoid unnecessary 128-bit shifts

2019-04-01 Thread George Spelvin
Double-word sign-extending shifts by a variable amount are a
non-trivial amount of code and complexity.  Doing signed long shifts
before the cast to (s_max) greatly simplifies the object code.

(Yes, I know "signed" is redundant.  It's there for emphasis.)

The complex issue raised by this patch is that allows s390 (at
least) to enable CONFIG_ARCH_SUPPORTS_INT128.

If you enable that option, s_max becomes 128 bits, and gcc compiles
the pre-patch code with a call to __ashrti3.  (And, on some gcc
versions, __ashlti3.)  Which isn't implemented, ergo link error.

Enabling that option allows 64-bit widening multiplies which
greatly simplify a lot of timestamp scaling code in the kernel,
so it's desirable.

But how to get there?

One option is to implement __ashrti3 on the platforms that need it.
But I'm inclined to *not* do so, because it's inefficient, rare,
and avoidable.  This patch fixes the sole instance in the entire
kernel, which will make that implementation dead code, and I think
its absence will encourage Don't Do That, Then going forward.

But if we don't implement it, we've created an awkward dependency
between patches in different subsystems, and that needs handling.

Option 1: Submit this for 5.2 and turn on INT128 for s390 in 5.3.
Option 2: Let the arches cherry-pick this patch pre-5.2.

My preference is for option 2, but that requires permission from
ubsan's owner.  Andrey?

Signed-off-by: George Spelvin 
Cc: Andrey Ryabinin 
Cc: linux-s...@vger.kernel.org
Cc: Heiko Carstens 
---
 lib/ubsan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/ubsan.c b/lib/ubsan.c
index e4162f59a81c..43ce177a5ca7 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -89,8 +89,8 @@ static bool is_inline_int(struct type_descriptor *type)
 static s_max get_signed_val(struct type_descriptor *type, unsigned long val)
 {
if (is_inline_int(type)) {
-   unsigned extra_bits = sizeof(s_max)*8 - type_bit_width(type);
-   return ((s_max)val) << extra_bits >> extra_bits;
+   unsigned extra_bits = sizeof(val)*8 - type_bit_width(type);
+   return (s_max)((signed long)val << extra_bits >> extra_bits);
}
 
if (type_bit_width(type) == 64)
-- 
2.20.1



[PATCH 6/5] lib/list_sort: Fix GCC warning

2019-03-28 Thread George Spelvin
It turns out that GCC 4.9, 7.3, and 8.1 ignore the __pure
attribute on function pointers and (with the standard kernel
compile flags) emit a warning about it.

Even though it accurately describes a comparison function
(the compiler need not reload cached pointers across the call),
it doesn't actually help GCC 8.3's code generation, so just
omit it.

Signed-off-by: George Spelvin 
Fixes: 820c81be5237 ("lib/list_sort: simplify and remove MAX_LIST_LENGTH_BITS")
Cc: Andrew Morton 
Cc: Stephen Rothwell 
---
 lib/list_sort.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/lib/list_sort.c b/lib/list_sort.c
index 623a9158ac8a..b1b492e20f1d 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -8,12 +8,16 @@
 #include 
 
 /*
- * By declaring the compare function with the __pure attribute, we give
- * the compiler more opportunity to optimize.  Ideally, we'd use this in
- * the prototype of list_sort(), but that would involve a lot of churn
- * at all call sites, so just cast the function pointer passed in.
+ * A more accurate type for comparison functions.  Ideally, we'd use
+ * this in the prototype of list_sort(), but that would involve a lot of
+ * churn at all call sites, so just cast the function pointer passed in.
+ *
+ * This could also include __pure to give the compiler more opportunity
+ * to optimize, but that elicits an "attribute ignored" warning on
+ * GCC <= 8.1, and doesn't change GCC 8.3's code generation at all,
+ * so it's omitted.
  */
-typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *,
+typedef int __attribute__((nonnull(2,3))) (*cmp_func)(void *,
struct list_head const *, struct list_head const *);
 
 /*
-- 
2.20.1



Re: [RESEND PATCH v2 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-28 Thread George Spelvin
Than you all for the build warning report.

The warning produced by gcc versions 4.9, 7.3, 8.1, whatever version
Stephen Rothwell is running, is:
lib/list_sort.c:17:36: warning: __pure__ attribute ignored [-Wattributes]

The relevant code is:
10: /*
11:  * By declaring the compare function with the __pure attribute, we give
12:  * the compiler more opportunity to optimize.  Ideally, we'd use this in
13:  * the prototype of list_sort(), but that would involve a lot of churn
14:  * at all call sites, so just cast the function pointer passed in.
15:  */
16: typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *,
17: struct list_head const *, struct list_head const *);

As the comment says, the purpose of the __pure attribute is to tell
the compiler that, after a call via a function pointer of this
type, memory is not clobbered and it is not necessary to reload
any cached list pointers.

This is, of course, purely optional and may be deleted harmlessly.
I just checked, and that makes no difference at all to gcc-8 code
generation, so there's no point messing with #ifdef.

There are only two questions: how to update the comment, and how
to submit the fix. I'm thinking of
/*
 * A more accurate type for comparison functions.  Ideally, we'd use
 * this in the prototype of list_sort(), but that would involve a lot of
 * churn at all call sites, so just cast the function pointer passed in.
 *
 * This could also include __pure to give the compiler more opportunity
 * to optimize, but that elicits an "attribute ignored" warning on
 * GCC <= 8.1, and doesn't change GCC 8.3's code generation at all,
 * so it's omitted.
 */

How to submit the fix: Andrew, do you prefer a replacement patch
or a small fix patch?  I'll assume the latter and send it in a few
minutes.


Re: [RFC PATCH v2] random: add get_random_max() function

2019-03-28 Thread George Spelvin
By the way, I just noticed that my fallback get_random_max64()
algorithm (if there's no __int128 type) is completely broken and
will need rewriting.

It would work if I rejected and regenerated the high half
if the low half were out of range, but that's not what it does.

The worst case is a range of 0x1001, where it would return
0x1000 half the time.

Needs rethinking to find something as simple as possible.
I'm sure I can come up with something, but I'm not averse to
suggestions if anyone has any.

(If I had a reliably fast clz/fls, that would open some
possibilities, but sigh...)


Re: [RFC PATCH] random: add get_random_max() function

2019-03-24 Thread George Spelvin
P.S. The cited paper calls your algorithm the "OpenBSD algorithm"
and has a bunch of benchmarks comparing it to others in Fisher-Yates
shuffles of sizes 1e3..1e9.

Including all overhead (base PRNG, shuffle), it's 3x slower for
32-bit operations and 8x slower for 64-bit up to arrays of size
1e6, after which cache misses slow all algorithms, reducing the
ratio.

If you want a faster division-based agorithm, the "Java algorithm"
does 1+retries divides:

unsigned long java(unsigned long s)
{
unsigned long x, r;

do {
x = random_integer();
r = x % s;
} while (x - r > -s);
return r;
}


Re: [RFC PATCH] random: add get_random_max() function

2019-03-24 Thread George Spelvin
On Sun, 24 Mar 2019 at 21:47:50 +0100, Jason A. Donenfeld wrote:
> I generally use a slightly simpler algorithm in various different projects:
> 
> //[0, bound)
> static unsigned long random_bounded(unsigned long bound)
> {
>unsigned long ret;
>const unsigned long max_mod_bound = (1 + ~bound) % bound;
> 
>if (bound < 2)
>return 0;
>do
>ret = random_integer();
>while (ret < max_mod_bound);
>return ret % bound;
> }
>
> Is the motivation behind using Lemire that you avoid the division (via
> the modulo) in favor of a multiplication?

Yes.  If we define eps = max_mod_bound * ldexp(1.0, -BITS_PER_LONG) as
the probability of one retry, and retries = eps / (1 - eps) as the
expected number of retries, then both algorithms take 1+retries
random_integer()s.

The above agorithm takes 2 divisions, always.  Divides are slow, and
usually not pipelined, so two in short succession gets a latency penalty.

Lemire's mutiplicative algorithm takes 1 multiplication on the fast
path (probability 1 - 2*eps on average), 1 additional division on the slow
path (probability 2*eps), and 1 multiplication per retry.

In the common case when bound is much less than ULONG_MAX, eps is
tiny and the fast path is taken almost all the time, and it's
a huge win.

Even in the absolute worst case of bound = ULONG_MAX/2 + 2 when
eps ~ 0.5 (2 multiplies, 0.5 divide; there's no 2*eps penalty in
this case), it's faster as long as 2 mutiplies cost less than 1.5
divides.

I you want simpler code, we could omit the fast path and stil get
a speedup.  But a predictable branch for a divide seemed like
a worthwhile trade.


(FYI, this all came about as a side project of a kernel-janitor project
to replace "prandom_u32() % range" by "prandom_u32() * range >> 32".
I'm also annoyed that get_random_u32() and get_random_u64() have
separate buffers, even if EFFICIENT_UNALIGNED_ACCESS, but that's
a separate complaint.)


[RESEND PATCH v2 3/5] lib/sort: Avoid indirect calls to built-in swap

2019-03-19 Thread George Spelvin
Similar to what's being done in the net code, this takes advantage of
the fact that most invocations use only a few common swap functions, and
replaces indirect calls to them with (highly predictable) conditional
branches.  (The downside, of course, is that if you *do* use a custom
swap function, there are a few extra predicted branches on the code path.)

This actually *shrinks* the x86-64 code, because it inlines the various
swap functions inside do_swap, eliding function prologues & epilogues.

x86-64 code size 767 -> 703 bytes (-64)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
---
 lib/sort.c | 51 ---
 1 file changed, 36 insertions(+), 15 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index 0d24d0c5c0fc..50855ea8c262 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -54,10 +54,8 @@ static bool is_aligned(const void *base, size_t size, 
unsigned char align)
  * subtract (since the intervening mov instructions don't alter the flags).
  * Gcc 8.1.0 doesn't have that problem.
  */
-static void swap_words_32(void *a, void *b, int size)
+static void swap_words_32(void *a, void *b, size_t n)
 {
-   size_t n = (unsigned int)size;
-
do {
u32 t = *(u32 *)(a + (n -= 4));
*(u32 *)(a + n) = *(u32 *)(b + n);
@@ -80,10 +78,8 @@ static void swap_words_32(void *a, void *b, int size)
  * but it's possible to have 64-bit loads without 64-bit pointers (e.g.
  * x32 ABI).  Are there any cases the kernel needs to worry about?
  */
-static void swap_words_64(void *a, void *b, int size)
+static void swap_words_64(void *a, void *b, size_t n)
 {
-   size_t n = (unsigned int)size;
-
do {
 #ifdef CONFIG_64BIT
u64 t = *(u64 *)(a + (n -= 8));
@@ -109,10 +105,8 @@ static void swap_words_64(void *a, void *b, int size)
  *
  * This is the fallback if alignment doesn't allow using larger chunks.
  */
-static void swap_bytes(void *a, void *b, int size)
+static void swap_bytes(void *a, void *b, size_t n)
 {
-   size_t n = (unsigned int)size;
-
do {
char t = ((char *)a)[--n];
((char *)a)[n] = ((char *)b)[n];
@@ -120,6 +114,33 @@ static void swap_bytes(void *a, void *b, int size)
} while (n);
 }
 
+typedef void (*swap_func_t)(void *a, void *b, int size);
+
+/*
+ * The values are arbitrary as long as they can't be confused with
+ * a pointer, but small integers make for the smallest compare
+ * instructions.
+ */
+#define SWAP_WORDS_64 (swap_func_t)0
+#define SWAP_WORDS_32 (swap_func_t)1
+#define SWAP_BYTES(swap_func_t)2
+
+/*
+ * The function pointer is last to make tail calls most efficient if the
+ * compiler decides not to inline this function.
+ */
+static void do_swap(void *a, void *b, size_t size, swap_func_t swap_func)
+{
+   if (swap_func == SWAP_WORDS_64)
+   swap_words_64(a, b, size);
+   else if (swap_func == SWAP_WORDS_32)
+   swap_words_32(a, b, size);
+   else if (swap_func == SWAP_BYTES)
+   swap_bytes(a, b, size);
+   else
+   swap_func(a, b, (int)size);
+}
+
 /**
  * parent - given the offset of the child, find the offset of the parent.
  * @i: the offset of the heap element whose parent is sought.  Non-zero.
@@ -157,7 +178,7 @@ static size_t parent(size_t i, unsigned int lsbit, size_t 
size)
  * This function does a heapsort on the given array.  You may provide
  * a swap_func function if you need to do something more than a memory
  * copy (e.g. fix up pointers or auxiliary data), but the built-in swap
- * isn't usually a bottleneck.
+ * avoids a slow retpoline and so is significantly faster.
  *
  * Sorting time is O(n log n) both on average and worst-case. While
  * quicksort is slightly faster on average, it suffers from exploitable
@@ -177,11 +198,11 @@ void sort(void *base, size_t num, size_t size,
 
if (!swap_func) {
if (is_aligned(base, size, 8))
-   swap_func = swap_words_64;
+   swap_func = SWAP_WORDS_64;
else if (is_aligned(base, size, 4))
-   swap_func = swap_words_32;
+   swap_func = SWAP_WORDS_32;
else
-   swap_func = swap_bytes;
+   swap_func = SWAP_BYTES;
}
 
/*
@@ -197,7 +218,7 @@ void sort(void *base, size_t num, size_t size,
if (a)  /* Building heap: sift down --a */
a -= size;
else if (n -= size) /* Sorting: Extract root to --n */
-   swap_func(base, base + n, size);
+   do_swap(base, base + n, size, swap_func);
else/* Sort complete */
break;
 
@@ -224,7 +245,7 @@ void sort(void *base, size_t num, size_t size,
c = b;  /* Where "a" belongs */

[RESEND PATCH v2 0/5] lib/sort & lib/list_sort: faster and smaller

2019-03-19 Thread George Spelvin
(Resend because earlier send had GIT_AUTHOR_DATE in the e-mail
headers and got filed with last month's archives.  And probably
tripped a few spam filters.)

v1->v2: Various spelling, naming and code style cleanups.
Generally positive and no negative responses to the
goals and algorithms used.

I'm running these patches, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, on the machine I'm sending this from.
I have tweaked the comments further, but I have verified
the compiled object code is identical to a snapshot I took
when I rebooted.

As far as I'm concerned, this is ready to be merged.
As there is no owner in MAINTAINERS, I was thinking of
sending it via AKPM, like the recent lib/lzo changes.
Andrew, is that okay with you?

Because CONFIG_RETPOLINE has made indirect calls much more expensive,
I thought I'd try to reduce the number made by the library sort
functions.

The first three patches apply to lib/sort.c.

Patch #1 is a simple optimization.  The built-in swap has special cases
for aligned 4- and 8-byte objects.  But those are almost never used;
most calls to sort() work on larger structures, which fall back to the
byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
thrashing the store buffers as much.)

Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that
nice simple solid heapsort is preferable to more complex algorithms
(sorry, Andrey), but it's possible to implement heapsort with far fewer
comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
than the way it's been done up to now.  And with some care, the code
ends up smaller, as well.  This is the "big win" patch.

Patch #3 adds the same sort of indirect call bypass that has been added
to the net code of late.  The great majority of the callers use the
builtin swap functions, so replace the indirect call to sort_func with a
(highly preditable) series of if() statements.  Rather surprisingly,
this decreased code size, as the swap functions were inlined and their
prologue & epilogue code eliminated.

lib/list_sort.c is a bit trickier, as merge sort is already close to
optimal, and we don't want to introduce triumphs of theory over
practicality like the Ford-Johnson merge-insertion sort.

Patch #4, without changing the algorithm, chops 32% off the code size and
removes the part[MAX_LIST_LENGTH+1] pointer array (and the corresponding
upper limit on efficiently sortable input size).

Patch #5 improves the algorithm.  The previous code is already optimal
for power-of-two (or slightly smaller) size inputs, but when the input
size is just over a power of 2, there's a very unbalanced final merge.

There are, in the literature, several algorithms which solve this, but
they all depend on the "breadth-first" merge order which was replaced
by commit 835cc0c8477f with a more cache-friendly "depth-first" order.
Some hard thinking came up with a depth-first algorithm which defers
merges as little as possible while avoiding bad merges.  This saves
0.2*n compares, averaged over all sizes.

The code size increase is minimal (64 bytes on x86-64, reducing the net
savings to 26%), but the comments expanded significantly to document
the clever algorithm.


TESTING NOTES: I have some ugly user-space benchmarking code
which I used for testing before moving this code into the kernel.
Shout if you want a copy.

I'm running this code right now, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since
the last round of minor edits to quell checkpatch.  I figure there
will be at least one round of comments and final testing.

George Spelvin (5):
  lib/sort: Make swap functions more generic
  lib/sort: Use more efficient bottom-up heapsort variant
  lib/sort: Avoid indirect calls to built-in swap
  lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS
  lib/list_sort: Optimize number of calls to comparison function

 include/linux/list_sort.h |   1 +
 lib/list_sort.c   | 244 +-
 lib/sort.c| 266 +-
 3 files changed, 387 insertions(+), 124 deletions(-)

-- 
2.20.1



[RESEND PATCH v2 1/5] lib/sort: Make swap functions more generic

2019-03-19 Thread George Spelvin
Rather than having special-case swap functions for 4- and 8-byte objects,
special-case aligned multiples of 4 or 8 bytes.  This speeds up most
users of sort() by avoiding fallback to the byte copy loop.

Despite what commit ca96ab859ab4 ("lib/sort: Add 64 bit swap function")
claims, very few users of sort() sort pointers (or pointer-sized
objects); most sort structures containing at least two words.
(E.g. drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte
struct acpi_fan_fps.)

The functions also got renamed to reflect the fact that they support
multiple words.  In the great tradition of bikeshedding, the names were
by far the most contentious issue during review of this patch series.

x86-64 code size 872 -> 886 bytes (+14)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
Feedback-from: Andy Shevchenko 
Feedback-from: Rasmus Villemoes 
Feedback-from: Geert Uytterhoeven 
---
 lib/sort.c | 135 +
 1 file changed, 105 insertions(+), 30 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index d6b7a202b0b6..ec79eac85e21 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -11,35 +11,108 @@
 #include 
 #include 
 
-static int alignment_ok(const void *base, int align)
+/**
+ * is_aligned - is this pointer & size okay for word-wide copying?
+ * @base: pointer to data
+ * @size: size of each element
+ * @align: required aignment (typically 4 or 8)
+ *
+ * Returns true if elements can be copied using word loads and stores.
+ * The size must be a multiple of the alignment, and the base address must
+ * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
+ *
+ * For some reason, gcc doesn't know to optimize "if (a & mask || b & mask)"
+ * to "if ((a | b) & mask)", so we do that by hand.
+ */
+__attribute_const__ __always_inline
+static bool is_aligned(const void *base, size_t size, unsigned char align)
 {
-   return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
-   ((unsigned long)base & (align - 1)) == 0;
+   unsigned char lsbits = (unsigned char)size;
+
+   (void)base;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   lsbits |= (unsigned char)(uintptr_t)base;
+#endif
+   return (lsbits & (align - 1)) == 0;
 }
 
-static void u32_swap(void *a, void *b, int size)
+/**
+ * swap_words_32 - swap two elements in 32-bit chunks
+ * @a, @b: pointers to the elements
+ * @size: element size (must be a multiple of 4)
+ *
+ * Exchange the two objects in memory.  This exploits base+index addressing,
+ * which basically all CPUs have, to minimize loop overhead computations.
+ *
+ * For some reason, on x86 gcc 7.3.0 adds a redundant test of n at the
+ * bottom of the loop, even though the zero flag is stil valid from the
+ * subtract (since the intervening mov instructions don't alter the flags).
+ * Gcc 8.1.0 doesn't have that problem.
+ */
+static void swap_words_32(void *a, void *b, int size)
 {
-   u32 t = *(u32 *)a;
-   *(u32 *)a = *(u32 *)b;
-   *(u32 *)b = t;
-}
-
-static void u64_swap(void *a, void *b, int size)
-{
-   u64 t = *(u64 *)a;
-   *(u64 *)a = *(u64 *)b;
-   *(u64 *)b = t;
-}
-
-static void generic_swap(void *a, void *b, int size)
-{
-   char t;
+   size_t n = (unsigned int)size;
 
do {
-   t = *(char *)a;
-   *(char *)a++ = *(char *)b;
-   *(char *)b++ = t;
-   } while (--size > 0);
+   u32 t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+   } while (n);
+}
+
+/**
+ * swap_words_64 - swap two elements in 64-bit chunks
+ * @a, @b: pointers to the elements
+ * @size: element size (must be a multiple of 8)
+ *
+ * Exchange the two objects in memory.  This exploits base+index
+ * addressing, which basically all CPUs have, to minimize loop overhead
+ * computations.
+ *
+ * We'd like to use 64-bit loads if possible.  If they're not, emulating
+ * one requires base+index+4 addressing which x86 has but most other
+ * processors do not.  If CONFIG_64BIT, we definitely have 64-bit loads,
+ * but it's possible to have 64-bit loads without 64-bit pointers (e.g.
+ * x32 ABI).  Are there any cases the kernel needs to worry about?
+ */
+static void swap_words_64(void *a, void *b, int size)
+{
+   size_t n = (unsigned int)size;
+
+   do {
+#ifdef CONFIG_64BIT
+   u64 t = *(u64 *)(a + (n -= 8));
+   *(u64 *)(a + n) = *(u64 *)(b + n);
+   *(u64 *)(b + n) = t;
+#else
+   /* Use two 32-bit transfers to avoid base+index+4 addressing */
+   u32 t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+
+   t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+#endif
+   } while (n);
+}
+
+/**
+ 

[RESEND PATCH v2 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-19 Thread George Spelvin
Rather than a fixed-size array of pending sorted runs, use the ->prev
links to keep track of things.  This reduces stack usage, eliminates
some ugly overflow handling, and reduces the code size.

Also:
* merge() no longer needs to handle NULL inputs, so simplify.
* The same applies to merge_and_restore_back_links(), which is renamed
  to the less ponderous merge_final().  (It's a static helper function,
  so we don't need a super-descriptive name; comments will do.)
* Document the actual return value requirements on the (*cmp)()
  function; some callers are already using this feature.

x86-64 code size 1086 -> 739 bytes (-347)

(Yes, I see checkpatch complaining about no space after comma in
"__attribute__((nonnull(2,3,4,5)))".  Checkpatch is wrong.)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
Feedback-from: Rasmus Villemoes 
Feedback-from: Andy Shevchenko 
Feedback-from: Geert Uytterhoeven 
---
 include/linux/list_sort.h |   1 +
 lib/list_sort.c   | 169 --
 2 files changed, 109 insertions(+), 61 deletions(-)

diff --git a/include/linux/list_sort.h b/include/linux/list_sort.h
index ba79956e848d..20f178c24e9d 100644
--- a/include/linux/list_sort.h
+++ b/include/linux/list_sort.h
@@ -6,6 +6,7 @@
 
 struct list_head;
 
+__attribute__((nonnull(2,3)))
 void list_sort(void *priv, struct list_head *head,
   int (*cmp)(void *priv, struct list_head *a,
  struct list_head *b));
diff --git a/lib/list_sort.c b/lib/list_sort.c
index 85759928215b..fc807dd60a51 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -7,33 +7,47 @@
 #include 
 #include 
 
-#define MAX_LIST_LENGTH_BITS 20
+/*
+ * By declaring the compare function with the __pure attribute, we give
+ * the compiler more opportunity to optimize.  Ideally, we'd use this in
+ * the prototype of list_sort(), but that would involve a lot of churn
+ * at all call sites, so just cast the function pointer passed in.
+ */
+typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *,
+   struct list_head const *, struct list_head const *);
 
 /*
  * Returns a list organized in an intermediate format suited
  * to chaining of merge() calls: null-terminated, no reserved or
  * sentinel head node, "prev" links not maintained.
  */
-static struct list_head *merge(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
+__attribute__((nonnull(2,3,4)))
+static struct list_head *merge(void *priv, cmp_func cmp,
struct list_head *a, struct list_head *b)
 {
-   struct list_head head, *tail = 
+   struct list_head *head, **tail = 
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
-   tail->next = a;
+   if (cmp(priv, a, b) <= 0) {
+   *tail = a;
+   tail = >next;
a = a->next;
+   if (!a) {
+   *tail = b;
+   break;
+   }
} else {
-   tail->next = b;
+   *tail = b;
+   tail = >next;
b = b->next;
+   if (!b) {
+   *tail = a;
+   break;
+   }
}
-   tail = tail->next;
}
-   tail->next = a?:b;
-   return head.next;
+   return head;
 }
 
 /*
@@ -43,44 +57,52 @@ static struct list_head *merge(void *priv,
  * prev-link restoration pass, or maintaining the prev links
  * throughout.
  */
-static void merge_and_restore_back_links(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
-   struct list_head *head,
-   struct list_head *a, struct list_head *b)
+__attribute__((nonnull(2,3,4,5)))
+static void merge_final(void *priv, cmp_func cmp, struct list_head *head,
+   struct list_head *a, struct list_head *b)
 {
struct list_head *tail = head;
u8 count = 0;
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
+   if (cmp(priv, a, b) <= 0) {
tail->next = a;
a->prev = tail;
+   tail = a;
a = a->next;
+   if (!a)
+   break;
} else {
t

[RESEND PATCH v2 5/5] lib/list_sort: Optimize number of calls to comparison function

2019-03-19 Thread George Spelvin
CONFIG_RETPOLINE has severely degraded indirect function call
performance, so it's worth putting some effort into reducing
the number of times cmp() is called.

This patch avoids badly unbalanced merges on unlucky input sizes.
It slightly increases the code size, but saves an average of 0.2*n
calls to cmp().

x86-64 code size 739 -> 803 bytes (+64)

Unfortunately, there's not a lot of low-hanging fruit in a merge
sort; it already performs only n*log2(n) - K*n + O(1) compares.
The leading coefficient is already at the theoretical limit (log2(n!)
corresponds to K=1.4427), so we're fighting over the linear term, and
the best mergesort can do is K=1.2645, achieved when n is a power of 2.

The differences between mergesort variants appear when n is *not*
a power of 2; K is a function of the fractional part of log2(n).
Top-down mergesort does best of all, achieving a minimum K=1.2408, and
an average (over all sizes) K=1.248.  However, that requires knowing
the number of entries to be sorted ahead of time, and making a full
pass over the input to count it conflicts with a second performance
goal, which is cache blocking.

Obviously, we have to read the entire list into L1 cache at some point,
and performance is best if it fits.  But if it doesn't fit, each full
pass over the input causes a cache miss per element, which is undesirable.

While textbooks explain bottom-up mergesort as a succession of merging
passes, practical implementations do merging in depth-first order:
as soon as two lists of the same size are available, they are merged.
This allows as many merge passes as possible to fit into L1; only the
final few merges force cache misses.

This cache-friendly depth-first merge order depends on us merging the
beginning of the input as much as possible before we've even seen the
end of the input (and thus know its size).

The simple eager merge pattern causes bad performance when n is just
over a power of 2.  If n=1028, the final merge is between 1024- and
4-element lists, which is wasteful of comparisons.  (This is actually
worse on average than n=1025, because a 1204:1 merge will, on average,
end after 512 compares, while 1024:4 will walk 4/5 of the list.)

Because of this, bottom-up mergesort achieves K < 0.5 for such sizes,
and has an average (over all sizes) K of around 1.  (My experiments
show K=1.01, while theory predicts K=0.965.)

There are "worst-case optimal" variants of bottom-up mergesort which
avoid this bad performance, but the algorithms given in the literature,
such as queue-mergesort and boustrodephonic mergesort, depend on the
breadth-first multi-pass structure that we are trying to avoid.

This implementation is as eager as possible while ensuring that all merge
passes are at worst 1:2 unbalanced.  This achieves the same average
K=1.207 as queue-mergesort, which is 0.2*n better then bottom-up, and
only 0.04*n behind top-down mergesort.

Specifically, defers merging two lists of size 2^k until it is known
that there are 2^k additional inputs following.  This ensures that the
final uneven merges triggered by reaching the end of the input will be
at worst 2:1.  This will avoid cache misses as long as 3*2^k elements
fit into the cache.

(I confess to being more than a little bit proud of how clean this
code turned out.  It took a lot of thinking, but the resultant inner
loop is very simple and efficient.)

Refs:
  Bottom-up Mergesort: A Detailed Analysis
  Wolfgang Panny, Helmut Prodinger
  Algorithmica 14(4):340--354, October 1995
  https://doi.org/10.1007/BF01294131
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260

  The cost distribution of queue-mergesort, optimal mergesorts, and
  power-of-two rules
  Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen
  Journal of Algorithms 30(2); Pages 423--448, February 1999
  https://doi.org/10.1006/jagm.1998.0986
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380

  Queue-Mergesort
  Mordecai J. Golin, Robert Sedgewick
  Information Processing Letters, 48(5):253--259, 10 December 1993
  https://doi.org/10.1016/0020-0190(93)90088-q
  https://sci-hub.tw/10.1016/0020-0190(93)90088-Q

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
Feedback-from: Rasmus Villemoes 
---
 lib/list_sort.c | 115 ++--
 1 file changed, 92 insertions(+), 23 deletions(-)

diff --git a/lib/list_sort.c b/lib/list_sort.c
index fc807dd60a51..623a9158ac8a 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -113,11 +113,6 @@ static void merge_final(void *priv, cmp_func cmp, struct 
list_head *head,
  * @head: the list to sort
  * @cmp: the elements comparison function
  *
- * This function implements a bottom-up merge sort, which has O(nlog(n))
- * complexity.  We use depth-first order to take advantage of cacheing.
- * (E.g. when we get to the fourth element, we immediately merge the
- * first two 2-element lists.)
- *
  * The comparison funtion @cmp must return > 0 if @a shoul

[RESEND PATCH v2 2/5] lib/sort: Use more efficient bottom-up heapsort variant

2019-03-19 Thread George Spelvin
This uses fewer comparisons than the previous code (approaching half
as many for large random inputs), but produces identical results;
it actually performs the exact same series of swap operations.

Specifically, it reduces the average number of compares from
2*n*log2(n) - 3*n + o(n)  to  n*log2(n) + 0.37*n + o(n).

This is still 1.63*n worse than glibc qsort() which manages
n*log2(n) - 1.26*n, but at least the leading coefficient is correct.

Standard heapsort, when sifting down, performs two comparisons
per level: one to find the greater child, and a second to see
if the current node should be exchanged with that child.

Bottom-up heapsort observes that it's better to postpone the second
comparison and search for the leaf where -infinity would be sent
to, then search back *up* for the current node's destination.

Since sifting down usually proceeds to the leaf level (that's where
half the nodes are), this does O(1) second comparisons rather
than log2(n).  That saves a lot of (expensive since Spectre)
indirect function calls.

The one time it's worse than the previous code is if there are
large numbers of duplicate keys, when the top-down algorithm is
O(n) and bottom-up is O(n log n).  For distinct keys, it's provably
always better, doing 1.5*n*log2(n) + O(n) in the worst case.

(The code is not significantly more complex.  This patch also
merges the heap-building and -extracting sift-down loops,
resulting in a net code size savings.)

x86-64 code size 885 -> 767 bytes (-118)

(I see the checkpatch complaint about "else if (n -= size)".
The alternative is significantly uglier.)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
---
 lib/sort.c | 110 ++---
 1 file changed, 80 insertions(+), 30 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index ec79eac85e21..0d24d0c5c0fc 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -1,8 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
+ * A fast, small, non-recursive O(n log n) sort for the Linux kernel
  *
- * Jan 23 2005  Matt Mackall 
+ * This performs n*log2(n) + 0.37*n + o(n) comparisons on average,
+ * and 1.5*n*log2(n) + O(n) in the (very contrived) worst case.
+ *
+ * Glibc qsort() manages n*log2(n) - 1.26*n for random inputs (1.63*n
+ * better) at the expense of stack usage and much larger code to avoid
+ * quicksort's O(n^2) worst case.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -15,7 +20,7 @@
  * is_aligned - is this pointer & size okay for word-wide copying?
  * @base: pointer to data
  * @size: size of each element
- * @align: required aignment (typically 4 or 8)
+ * @align: required alignment (typically 4 or 8)
  *
  * Returns true if elements can be copied using word loads and stores.
  * The size must be a multiple of the alignment, and the base address must
@@ -115,6 +120,32 @@ static void swap_bytes(void *a, void *b, int size)
} while (n);
 }
 
+/**
+ * parent - given the offset of the child, find the offset of the parent.
+ * @i: the offset of the heap element whose parent is sought.  Non-zero.
+ * @lsbit: a precomputed 1-bit mask, equal to "size & -size"
+ * @size: size of each element
+ *
+ * In terms of array indexes, the parent of element j = @i/@size is simply
+ * (j-1)/2.  But when working in byte offsets, we can't use implicit
+ * truncation of integer divides.
+ *
+ * Fortunately, we only need one bit of the quotient, not the full divide.
+ * @size has a least significant bit.  That bit will be clear if @i is
+ * an even multiple of @size, and set if it's an odd multiple.
+ *
+ * Logically, we're doing "if (i & lsbit) i -= size;", but since the
+ * branch is unpredictable, it's done with a bit of clever branch-free
+ * code instead.
+ */
+__attribute_const__ __always_inline
+static size_t parent(size_t i, unsigned int lsbit, size_t size)
+{
+   i -= size;
+   i -= size & -(i & lsbit);
+   return i / 2;
+}
+
 /**
  * sort - sort an array of elements
  * @base: pointer to data to sort
@@ -129,17 +160,20 @@ static void swap_bytes(void *a, void *b, int size)
  * isn't usually a bottleneck.
  *
  * Sorting time is O(n log n) both on average and worst-case. While
- * qsort is about 20% faster on average, it suffers from exploitable
+ * quicksort is slightly faster on average, it suffers from exploitable
  * O(n*n) worst-case behavior and extra memory requirements that make
  * it less suitable for kernel use.
  */
-
 void sort(void *base, size_t num, size_t size,
  int (*cmp_func)(const void *, const void *),
  void (*swap_func)(void *, void *, int size))
 {
/* pre-scale counters for performance */
-   int i = (num/2 - 1) * size, n = num * size, c, r;
+   size_t n = num * size, a = (num/2) * size;
+   const unsigned int lsbit = size & -size;  /* Used to find parent */
+
+ 

[PATCH v2 0/5] lib/sort & lib/list_sort: faster and smaller

2019-03-15 Thread George Spelvin
v1->v2: Various spelling, naming and code style cleanups.
Generally positive and no negative responses to the
goals and algorithms used.

I'm running these patches, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, on the machine I'm sending this from.
I have tweaked the comments further, but I have verified
the compiled object code is identical to a snapshot I took
when I rebooted.

As far as I'm concerned, this is ready to be merged.
As there is no owner in MAINTAINERS, I was thinking of
sending it via AKPM, like the recent lib/lzo changes.
Andrew, is that okay with you?

Because CONFIG_RETPOLINE has made indirect calls much more expensive,
I thought I'd try to reduce the number made by the library sort
functions.

The first three patches apply to lib/sort.c.

Patch #1 is a simple optimization.  The built-in swap has special cases
for aligned 4- and 8-byte objects.  But those are almost never used;
most calls to sort() work on larger structures, which fall back to the
byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
thrashing the store buffers as much.)

Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that
nice simple solid heapsort is preferable to more complex algorithms
(sorry, Andrey), but it's possible to implement heapsort with far fewer
comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
than the way it's been done up to now.  And with some care, the code
ends up smaller, as well.  This is the "big win" patch.

Patch #3 adds the same sort of indirect call bypass that has been added
to the net code of late.  The great majority of the callers use the
builtin swap functions, so replace the indirect call to sort_func with a
(highly preditable) series of if() statements.  Rather surprisingly,
this decreased code size, as the swap functions were inlined and their
prologue & epilogue code eliminated.

lib/list_sort.c is a bit trickier, as merge sort is already close to
optimal, and we don't want to introduce triumphs of theory over
practicality like the Ford-Johnson merge-insertion sort.

Patch #4, without changing the algorithm, chops 32% off the code size and
removes the part[MAX_LIST_LENGTH+1] pointer array (and the corresponding
upper limit on efficiently sortable input size).

Patch #5 improves the algorithm.  The previous code is already optimal
for power-of-two (or slightly smaller) size inputs, but when the input
size is just over a power of 2, there's a very unbalanced final merge.

There are, in the literature, several algorithms which solve this, but
they all depend on the "breadth-first" merge order which was replaced
by commit 835cc0c8477f with a more cache-friendly "depth-first" order.
Some hard thinking came up with a depth-first algorithm which defers
merges as little as possible while avoiding bad merges.  This saves
0.2*n compares, averaged over all sizes.

The code size increase is minimal (64 bytes on x86-64, reducing the net
savings to 26%), but the comments expanded significantly to document
the clever algorithm.


TESTING NOTES: I have some ugly user-space benchmarking code
which I used for testing before moving this code into the kernel.
Shout if you want a copy.

I'm running this code right now, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since
the last round of minor edits to quell checkpatch.  I figure there
will be at least one round of comments and final testing.

George Spelvin (5):
  lib/sort: Make swap functions more generic
  lib/sort: Use more efficient bottom-up heapsort variant
  lib/sort: Avoid indirect calls to built-in swap
  lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS
  lib/list_sort: Optimize number of calls to comparison function

 include/linux/list_sort.h |   1 +
 lib/list_sort.c   | 244 +-
 lib/sort.c| 266 +-
 3 files changed, 387 insertions(+), 124 deletions(-)

-- 
2.20.1



[PATCH v2 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread George Spelvin
Rather than a fixed-size array of pending sorted runs, use the ->prev
links to keep track of things.  This reduces stack usage, eliminates
some ugly overflow handling, and reduces the code size.

Also:
* merge() no longer needs to handle NULL inputs, so simplify.
* The same applies to merge_and_restore_back_links(), which is renamed
  to the less ponderous merge_final().  (It's a static helper function,
  so we don't need a super-descriptive name; comments will do.)
* Document the actual return value requirements on the (*cmp)()
  function; some callers are already using this feature.

x86-64 code size 1086 -> 739 bytes (-347)

(Yes, I see checkpatch complaining about no space after comma in
"__attribute__((nonnull(2,3,4,5)))".  Checkpatch is wrong.)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
Feedback-from: Rasmus Villemoes 
Feedback-from: Andy Shevchenko 
Feedback-from: Geert Uytterhoeven 
---
 include/linux/list_sort.h |   1 +
 lib/list_sort.c   | 169 --
 2 files changed, 109 insertions(+), 61 deletions(-)

diff --git a/include/linux/list_sort.h b/include/linux/list_sort.h
index ba79956e848d..20f178c24e9d 100644
--- a/include/linux/list_sort.h
+++ b/include/linux/list_sort.h
@@ -6,6 +6,7 @@
 
 struct list_head;
 
+__attribute__((nonnull(2,3)))
 void list_sort(void *priv, struct list_head *head,
   int (*cmp)(void *priv, struct list_head *a,
  struct list_head *b));
diff --git a/lib/list_sort.c b/lib/list_sort.c
index 85759928215b..fc807dd60a51 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -7,33 +7,47 @@
 #include 
 #include 
 
-#define MAX_LIST_LENGTH_BITS 20
+/*
+ * By declaring the compare function with the __pure attribute, we give
+ * the compiler more opportunity to optimize.  Ideally, we'd use this in
+ * the prototype of list_sort(), but that would involve a lot of churn
+ * at all call sites, so just cast the function pointer passed in.
+ */
+typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *,
+   struct list_head const *, struct list_head const *);
 
 /*
  * Returns a list organized in an intermediate format suited
  * to chaining of merge() calls: null-terminated, no reserved or
  * sentinel head node, "prev" links not maintained.
  */
-static struct list_head *merge(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
+__attribute__((nonnull(2,3,4)))
+static struct list_head *merge(void *priv, cmp_func cmp,
struct list_head *a, struct list_head *b)
 {
-   struct list_head head, *tail = 
+   struct list_head *head, **tail = 
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
-   tail->next = a;
+   if (cmp(priv, a, b) <= 0) {
+   *tail = a;
+   tail = >next;
a = a->next;
+   if (!a) {
+   *tail = b;
+   break;
+   }
} else {
-   tail->next = b;
+   *tail = b;
+   tail = >next;
b = b->next;
+   if (!b) {
+   *tail = a;
+   break;
+   }
}
-   tail = tail->next;
}
-   tail->next = a?:b;
-   return head.next;
+   return head;
 }
 
 /*
@@ -43,44 +57,52 @@ static struct list_head *merge(void *priv,
  * prev-link restoration pass, or maintaining the prev links
  * throughout.
  */
-static void merge_and_restore_back_links(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
-   struct list_head *head,
-   struct list_head *a, struct list_head *b)
+__attribute__((nonnull(2,3,4,5)))
+static void merge_final(void *priv, cmp_func cmp, struct list_head *head,
+   struct list_head *a, struct list_head *b)
 {
struct list_head *tail = head;
u8 count = 0;
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
+   if (cmp(priv, a, b) <= 0) {
tail->next = a;
a->prev = tail;
+   tail = a;
a = a->next;
+   if (!a)
+   break;
} else {
t

[PATCH v2 3/5] lib/sort: Avoid indirect calls to built-in swap

2019-03-15 Thread George Spelvin
Similar to what's being done in the net code, this takes advantage of
the fact that most invocations use only a few common swap functions, and
replaces indirect calls to them with (highly predictable) conditional
branches.  (The downside, of course, is that if you *do* use a custom
swap function, there are a few extra predicted branches on the code path.)

This actually *shrinks* the x86-64 code, because it inlines the various
swap functions inside do_swap, eliding function prologues & epilogues.

x86-64 code size 767 -> 703 bytes (-64)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
---
 lib/sort.c | 51 ---
 1 file changed, 36 insertions(+), 15 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index 0d24d0c5c0fc..50855ea8c262 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -54,10 +54,8 @@ static bool is_aligned(const void *base, size_t size, 
unsigned char align)
  * subtract (since the intervening mov instructions don't alter the flags).
  * Gcc 8.1.0 doesn't have that problem.
  */
-static void swap_words_32(void *a, void *b, int size)
+static void swap_words_32(void *a, void *b, size_t n)
 {
-   size_t n = (unsigned int)size;
-
do {
u32 t = *(u32 *)(a + (n -= 4));
*(u32 *)(a + n) = *(u32 *)(b + n);
@@ -80,10 +78,8 @@ static void swap_words_32(void *a, void *b, int size)
  * but it's possible to have 64-bit loads without 64-bit pointers (e.g.
  * x32 ABI).  Are there any cases the kernel needs to worry about?
  */
-static void swap_words_64(void *a, void *b, int size)
+static void swap_words_64(void *a, void *b, size_t n)
 {
-   size_t n = (unsigned int)size;
-
do {
 #ifdef CONFIG_64BIT
u64 t = *(u64 *)(a + (n -= 8));
@@ -109,10 +105,8 @@ static void swap_words_64(void *a, void *b, int size)
  *
  * This is the fallback if alignment doesn't allow using larger chunks.
  */
-static void swap_bytes(void *a, void *b, int size)
+static void swap_bytes(void *a, void *b, size_t n)
 {
-   size_t n = (unsigned int)size;
-
do {
char t = ((char *)a)[--n];
((char *)a)[n] = ((char *)b)[n];
@@ -120,6 +114,33 @@ static void swap_bytes(void *a, void *b, int size)
} while (n);
 }
 
+typedef void (*swap_func_t)(void *a, void *b, int size);
+
+/*
+ * The values are arbitrary as long as they can't be confused with
+ * a pointer, but small integers make for the smallest compare
+ * instructions.
+ */
+#define SWAP_WORDS_64 (swap_func_t)0
+#define SWAP_WORDS_32 (swap_func_t)1
+#define SWAP_BYTES(swap_func_t)2
+
+/*
+ * The function pointer is last to make tail calls most efficient if the
+ * compiler decides not to inline this function.
+ */
+static void do_swap(void *a, void *b, size_t size, swap_func_t swap_func)
+{
+   if (swap_func == SWAP_WORDS_64)
+   swap_words_64(a, b, size);
+   else if (swap_func == SWAP_WORDS_32)
+   swap_words_32(a, b, size);
+   else if (swap_func == SWAP_BYTES)
+   swap_bytes(a, b, size);
+   else
+   swap_func(a, b, (int)size);
+}
+
 /**
  * parent - given the offset of the child, find the offset of the parent.
  * @i: the offset of the heap element whose parent is sought.  Non-zero.
@@ -157,7 +178,7 @@ static size_t parent(size_t i, unsigned int lsbit, size_t 
size)
  * This function does a heapsort on the given array.  You may provide
  * a swap_func function if you need to do something more than a memory
  * copy (e.g. fix up pointers or auxiliary data), but the built-in swap
- * isn't usually a bottleneck.
+ * avoids a slow retpoline and so is significantly faster.
  *
  * Sorting time is O(n log n) both on average and worst-case. While
  * quicksort is slightly faster on average, it suffers from exploitable
@@ -177,11 +198,11 @@ void sort(void *base, size_t num, size_t size,
 
if (!swap_func) {
if (is_aligned(base, size, 8))
-   swap_func = swap_words_64;
+   swap_func = SWAP_WORDS_64;
else if (is_aligned(base, size, 4))
-   swap_func = swap_words_32;
+   swap_func = SWAP_WORDS_32;
else
-   swap_func = swap_bytes;
+   swap_func = SWAP_BYTES;
}
 
/*
@@ -197,7 +218,7 @@ void sort(void *base, size_t num, size_t size,
if (a)  /* Building heap: sift down --a */
a -= size;
else if (n -= size) /* Sorting: Extract root to --n */
-   swap_func(base, base + n, size);
+   do_swap(base, base + n, size, swap_func);
else/* Sort complete */
break;
 
@@ -224,7 +245,7 @@ void sort(void *base, size_t num, size_t size,
c = b;  /* Where "a" belongs */

[PATCH v2 1/5] lib/sort: Make swap functions more generic

2019-03-15 Thread George Spelvin
Rather than having special-case swap functions for 4- and 8-byte objects,
special-case aligned multiples of 4 or 8 bytes.  This speeds up most
users of sort() by avoiding fallback to the byte copy loop.

Despite what commit ca96ab859ab4 ("lib/sort: Add 64 bit swap function")
claims, very few users of sort() sort pointers (or pointer-sized
objects); most sort structures containing at least two words.
(E.g. drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte
struct acpi_fan_fps.)

The functions also got renamed to reflect the fact that they support
multiple words.  In the great tradition of bikeshedding, the names were
by far the most contentious issue during review of this patch series.

x86-64 code size 872 -> 886 bytes (+14)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
Feedback-from: Andy Shevchenko 
Feedback-from: Rasmus Villemoes 
Feedback-from: Geert Uytterhoeven 
---
 lib/sort.c | 135 +
 1 file changed, 105 insertions(+), 30 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index d6b7a202b0b6..ec79eac85e21 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -11,35 +11,108 @@
 #include 
 #include 
 
-static int alignment_ok(const void *base, int align)
+/**
+ * is_aligned - is this pointer & size okay for word-wide copying?
+ * @base: pointer to data
+ * @size: size of each element
+ * @align: required aignment (typically 4 or 8)
+ *
+ * Returns true if elements can be copied using word loads and stores.
+ * The size must be a multiple of the alignment, and the base address must
+ * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
+ *
+ * For some reason, gcc doesn't know to optimize "if (a & mask || b & mask)"
+ * to "if ((a | b) & mask)", so we do that by hand.
+ */
+__attribute_const__ __always_inline
+static bool is_aligned(const void *base, size_t size, unsigned char align)
 {
-   return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
-   ((unsigned long)base & (align - 1)) == 0;
+   unsigned char lsbits = (unsigned char)size;
+
+   (void)base;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   lsbits |= (unsigned char)(uintptr_t)base;
+#endif
+   return (lsbits & (align - 1)) == 0;
 }
 
-static void u32_swap(void *a, void *b, int size)
+/**
+ * swap_words_32 - swap two elements in 32-bit chunks
+ * @a, @b: pointers to the elements
+ * @size: element size (must be a multiple of 4)
+ *
+ * Exchange the two objects in memory.  This exploits base+index addressing,
+ * which basically all CPUs have, to minimize loop overhead computations.
+ *
+ * For some reason, on x86 gcc 7.3.0 adds a redundant test of n at the
+ * bottom of the loop, even though the zero flag is stil valid from the
+ * subtract (since the intervening mov instructions don't alter the flags).
+ * Gcc 8.1.0 doesn't have that problem.
+ */
+static void swap_words_32(void *a, void *b, int size)
 {
-   u32 t = *(u32 *)a;
-   *(u32 *)a = *(u32 *)b;
-   *(u32 *)b = t;
-}
-
-static void u64_swap(void *a, void *b, int size)
-{
-   u64 t = *(u64 *)a;
-   *(u64 *)a = *(u64 *)b;
-   *(u64 *)b = t;
-}
-
-static void generic_swap(void *a, void *b, int size)
-{
-   char t;
+   size_t n = (unsigned int)size;
 
do {
-   t = *(char *)a;
-   *(char *)a++ = *(char *)b;
-   *(char *)b++ = t;
-   } while (--size > 0);
+   u32 t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+   } while (n);
+}
+
+/**
+ * swap_words_64 - swap two elements in 64-bit chunks
+ * @a, @b: pointers to the elements
+ * @size: element size (must be a multiple of 8)
+ *
+ * Exchange the two objects in memory.  This exploits base+index
+ * addressing, which basically all CPUs have, to minimize loop overhead
+ * computations.
+ *
+ * We'd like to use 64-bit loads if possible.  If they're not, emulating
+ * one requires base+index+4 addressing which x86 has but most other
+ * processors do not.  If CONFIG_64BIT, we definitely have 64-bit loads,
+ * but it's possible to have 64-bit loads without 64-bit pointers (e.g.
+ * x32 ABI).  Are there any cases the kernel needs to worry about?
+ */
+static void swap_words_64(void *a, void *b, int size)
+{
+   size_t n = (unsigned int)size;
+
+   do {
+#ifdef CONFIG_64BIT
+   u64 t = *(u64 *)(a + (n -= 8));
+   *(u64 *)(a + n) = *(u64 *)(b + n);
+   *(u64 *)(b + n) = t;
+#else
+   /* Use two 32-bit transfers to avoid base+index+4 addressing */
+   u32 t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+
+   t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+#endif
+   } while (n);
+}
+
+/**
+ 

[PATCH v2 2/5] lib/sort: Use more efficient bottom-up heapsort variant

2019-03-15 Thread George Spelvin
This uses fewer comparisons than the previous code (approaching half
as many for large random inputs), but produces identical results;
it actually performs the exact same series of swap operations.

Specifically, it reduces the average number of compares from
2*n*log2(n) - 3*n + o(n)  to  n*log2(n) + 0.37*n + o(n).

This is still 1.63*n worse than glibc qsort() which manages
n*log2(n) - 1.26*n, but at least the leading coefficient is correct.

Standard heapsort, when sifting down, performs two comparisons
per level: one to find the greater child, and a second to see
if the current node should be exchanged with that child.

Bottom-up heapsort observes that it's better to postpone the second
comparison and search for the leaf where -infinity would be sent
to, then search back *up* for the current node's destination.

Since sifting down usually proceeds to the leaf level (that's where
half the nodes are), this does O(1) second comparisons rather
than log2(n).  That saves a lot of (expensive since Spectre)
indirect function calls.

The one time it's worse than the previous code is if there are
large numbers of duplicate keys, when the top-down algorithm is
O(n) and bottom-up is O(n log n).  For distinct keys, it's provably
always better, doing 1.5*n*log2(n) + O(n) in the worst case.

(The code is not significantly more complex.  This patch also
merges the heap-building and -extracting sift-down loops,
resulting in a net code size savings.)

x86-64 code size 885 -> 767 bytes (-118)

(I see the checkpatch complaint about "else if (n -= size)".
The alternative is significantly uglier.)

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
---
 lib/sort.c | 110 ++---
 1 file changed, 80 insertions(+), 30 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index ec79eac85e21..0d24d0c5c0fc 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -1,8 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
+ * A fast, small, non-recursive O(n log n) sort for the Linux kernel
  *
- * Jan 23 2005  Matt Mackall 
+ * This performs n*log2(n) + 0.37*n + o(n) comparisons on average,
+ * and 1.5*n*log2(n) + O(n) in the (very contrived) worst case.
+ *
+ * Glibc qsort() manages n*log2(n) - 1.26*n for random inputs (1.63*n
+ * better) at the expense of stack usage and much larger code to avoid
+ * quicksort's O(n^2) worst case.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -15,7 +20,7 @@
  * is_aligned - is this pointer & size okay for word-wide copying?
  * @base: pointer to data
  * @size: size of each element
- * @align: required aignment (typically 4 or 8)
+ * @align: required alignment (typically 4 or 8)
  *
  * Returns true if elements can be copied using word loads and stores.
  * The size must be a multiple of the alignment, and the base address must
@@ -115,6 +120,32 @@ static void swap_bytes(void *a, void *b, int size)
} while (n);
 }
 
+/**
+ * parent - given the offset of the child, find the offset of the parent.
+ * @i: the offset of the heap element whose parent is sought.  Non-zero.
+ * @lsbit: a precomputed 1-bit mask, equal to "size & -size"
+ * @size: size of each element
+ *
+ * In terms of array indexes, the parent of element j = @i/@size is simply
+ * (j-1)/2.  But when working in byte offsets, we can't use implicit
+ * truncation of integer divides.
+ *
+ * Fortunately, we only need one bit of the quotient, not the full divide.
+ * @size has a least significant bit.  That bit will be clear if @i is
+ * an even multiple of @size, and set if it's an odd multiple.
+ *
+ * Logically, we're doing "if (i & lsbit) i -= size;", but since the
+ * branch is unpredictable, it's done with a bit of clever branch-free
+ * code instead.
+ */
+__attribute_const__ __always_inline
+static size_t parent(size_t i, unsigned int lsbit, size_t size)
+{
+   i -= size;
+   i -= size & -(i & lsbit);
+   return i / 2;
+}
+
 /**
  * sort - sort an array of elements
  * @base: pointer to data to sort
@@ -129,17 +160,20 @@ static void swap_bytes(void *a, void *b, int size)
  * isn't usually a bottleneck.
  *
  * Sorting time is O(n log n) both on average and worst-case. While
- * qsort is about 20% faster on average, it suffers from exploitable
+ * quicksort is slightly faster on average, it suffers from exploitable
  * O(n*n) worst-case behavior and extra memory requirements that make
  * it less suitable for kernel use.
  */
-
 void sort(void *base, size_t num, size_t size,
  int (*cmp_func)(const void *, const void *),
  void (*swap_func)(void *, void *, int size))
 {
/* pre-scale counters for performance */
-   int i = (num/2 - 1) * size, n = num * size, c, r;
+   size_t n = num * size, a = (num/2) * size;
+   const unsigned int lsbit = size & -size;  /* Used to find parent */
+
+ 

[PATCH v2 5/5] lib/list_sort: Optimize number of calls to comparison function

2019-03-15 Thread George Spelvin
CONFIG_RETPOLINE has severely degraded indirect function call
performance, so it's worth putting some effort into reducing
the number of times cmp() is called.

This patch avoids badly unbalanced merges on unlucky input sizes.
It slightly increases the code size, but saves an average of 0.2*n
calls to cmp().

x86-64 code size 739 -> 803 bytes (+64)

Unfortunately, there's not a lot of low-hanging fruit in a merge
sort; it already performs only n*log2(n) - K*n + O(1) compares.
The leading coefficient is already at the theoretical limit (log2(n!)
corresponds to K=1.4427), so we're fighting over the linear term, and
the best mergesort can do is K=1.2645, achieved when n is a power of 2.

The differences between mergesort variants appear when n is *not*
a power of 2; K is a function of the fractional part of log2(n).
Top-down mergesort does best of all, achieving a minimum K=1.2408, and
an average (over all sizes) K=1.248.  However, that requires knowing
the number of entries to be sorted ahead of time, and making a full
pass over the input to count it conflicts with a second performance
goal, which is cache blocking.

Obviously, we have to read the entire list into L1 cache at some point,
and performance is best if it fits.  But if it doesn't fit, each full
pass over the input causes a cache miss per element, which is undesirable.

While textbooks explain bottom-up mergesort as a succession of merging
passes, practical implementations do merging in depth-first order:
as soon as two lists of the same size are available, they are merged.
This allows as many merge passes as possible to fit into L1; only the
final few merges force cache misses.

This cache-friendly depth-first merge order depends on us merging the
beginning of the input as much as possible before we've even seen the
end of the input (and thus know its size).

The simple eager merge pattern causes bad performance when n is just
over a power of 2.  If n=1028, the final merge is between 1024- and
4-element lists, which is wasteful of comparisons.  (This is actually
worse on average than n=1025, because a 1204:1 merge will, on average,
end after 512 compares, while 1024:4 will walk 4/5 of the list.)

Because of this, bottom-up mergesort achieves K < 0.5 for such sizes,
and has an average (over all sizes) K of around 1.  (My experiments
show K=1.01, while theory predicts K=0.965.)

There are "worst-case optimal" variants of bottom-up mergesort which
avoid this bad performance, but the algorithms given in the literature,
such as queue-mergesort and boustrodephonic mergesort, depend on the
breadth-first multi-pass structure that we are trying to avoid.

This implementation is as eager as possible while ensuring that all merge
passes are at worst 1:2 unbalanced.  This achieves the same average
K=1.207 as queue-mergesort, which is 0.2*n better then bottom-up, and
only 0.04*n behind top-down mergesort.

Specifically, defers merging two lists of size 2^k until it is known
that there are 2^k additional inputs following.  This ensures that the
final uneven merges triggered by reaching the end of the input will be
at worst 2:1.  This will avoid cache misses as long as 3*2^k elements
fit into the cache.

(I confess to being more than a little bit proud of how clean this
code turned out.  It took a lot of thinking, but the resultant inner
loop is very simple and efficient.)

Refs:
  Bottom-up Mergesort: A Detailed Analysis
  Wolfgang Panny, Helmut Prodinger
  Algorithmica 14(4):340--354, October 1995
  https://doi.org/10.1007/BF01294131
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260

  The cost distribution of queue-mergesort, optimal mergesorts, and
  power-of-two rules
  Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen
  Journal of Algorithms 30(2); Pages 423--448, February 1999
  https://doi.org/10.1006/jagm.1998.0986
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380

  Queue-Mergesort
  Mordecai J. Golin, Robert Sedgewick
  Information Processing Letters, 48(5):253--259, 10 December 1993
  https://doi.org/10.1016/0020-0190(93)90088-q
  https://sci-hub.tw/10.1016/0020-0190(93)90088-Q

Signed-off-by: George Spelvin 
Acked-by: Andrey Abramov 
Feedback-from: Rasmus Villemoes 
---
 lib/list_sort.c | 115 ++--
 1 file changed, 92 insertions(+), 23 deletions(-)

diff --git a/lib/list_sort.c b/lib/list_sort.c
index fc807dd60a51..623a9158ac8a 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -113,11 +113,6 @@ static void merge_final(void *priv, cmp_func cmp, struct 
list_head *head,
  * @head: the list to sort
  * @cmp: the elements comparison function
  *
- * This function implements a bottom-up merge sort, which has O(nlog(n))
- * complexity.  We use depth-first order to take advantage of cacheing.
- * (E.g. when we get to the fourth element, we immediately merge the
- * first two 2-element lists.)
- *
  * The comparison funtion @cmp must return > 0 if @a shoul

Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread George Spelvin
Indeed, thanks to everyone who commented.  The extra conceptual
complexity and reduced readbility is Just Not Worth It.

v2 (and final, as far as I'm concerned) follows.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread George Spelvin
On Fri, 15 Mar 2019 at 13:57:05 +0100, Geert Uytterhoeven wrote:
> On Fri, Mar 15, 2019 at 11:23 AM George Spelvin  wrote:
>> On Fri, 15 Mar 2019 at 09:20:58 +0100, Geert Uytterhoeven wrote:
>>> On Fri, Mar 15, 2019 at 5:33 AM George Spelvin  wrote:
>>>> One question I should ask everyone: should "count" be 32 or 64 bits
>>>> on 64-bit machines?  That would let x86 save a few REX bytes.  (815
>>>> vs. 813 byte code, if anyone cares.)
>>>>
>>>> Allegedy ARM can save a few pJ by gating the high 32
>>>> bits of the ALU.
>>>>
>>>> Most other 64-bit processors would prefer 64-bit operations as
>>>> it saves masking operations.
> 
> So just make it unsigned int, unconditionally.

As I wrote originally (and quoted above), other 64-bit machines don't
have 32-bit operations and prefer 64-bit operations because they don't
require masking.  x86 (for historical compatibiity) and ARM (for power
saving) are the ones that come to mind.

I'm trying to present the case to spur discussion, but it realy is
a *question* I'm asking about whether to do that, not a suggestion
phrased as a question.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread George Spelvin
On Fri, 15 Mar 2019 at 09:20:58 +0100, Geert Uytterhoeven wrote:
> On Fri, Mar 15, 2019 at 5:33 AM George Spelvin  wrote:
>> On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
>>> On Tue, Mar 05, 2019 at 03:06:44AM +0000, George Spelvin wrote:
>>>> +for (bit = 1; count & bit; bit <<= 1) {
>>>> +cur = merge(priv, (cmp_func)cmp, pending, cur);
>>>> +pending = pending->prev;  /* Untouched by merge() */
>>>>  }
>>>
>>> Wouldn't be it the same to
>>>
>>>   bit = ffz(count);
>>>   while (bit--) {
>>>   ...
>>>   }
>>> ?
>>>
>>> Though I dunno which one is generating better code.
>>
>> One question I should ask everyone: should "count" be 32 or 64 bits
>> on 64-bit machines?  That would let x86 save a few REX bytes.  (815
>> vs. 813 byte code, if anyone cares.)
>>
>> Allegedy ARM can save a few pJ by gating the high 32
>> bits of the ALU.
>>
>> Most other 64-bit processors would prefer 64-bit operations as
>> it saves masking operations.
>>
>> If we never sort a list with more than 2^32 entries, it
>> makes no difference.
>>
>> If we use a 32-bit count and we *do* sort a list with more than
>> 2^32 entries, then it still sorts, but the performance degrades to
>> O((n/2^32)^2).
>>
>> Just how often do we expect the kernel to face lists that long?
>> (Note that the old code was O((n/2^20)^2).)
>
> Using size_t sounds most logical to me (argument of least surprise).

Yes, it is the obvious solution, which is why that's my default choice.

But a bit of thought shows that a list long enough to break a
32-bit implementation is beyond ridiculous.

The list must be at least 3 * 2^32 elements long to make the sort
merge non-optimally.  That's 1.5 * 2^37 bytes (192 GiB) of list_head
structures alone; at least double that for any practical application.
And 32 * 3 * 2^32 + (2 + 3) * 2^32 = 101 * 2^32 = 1.57 * 2^38
compares.

That seems like a lot but that's not the botteneck.  Each compare
reads from a new list element, and pretty soon, they'll miss all
caches and go to main memory.

Since the memory locations are random, for any small subset of the
list, you'll get only one element per cache line.  A 32 MiB L3
cache is 2^19 cache lines (assuming 64B lines).  So merge levels
20 through 33 will go to main memory.

That's (12 * 3 + 5) * 2^32 = 1.28 * 2^37 cache misses.  At 60 ns each (typical
local DRAM access time on i7 & Xeon according to Intel), that's a
hard minimum of 10565 seconds = 2h 56m 05s in one list_sort call.

This is definitely the scale of problem where a mutithreaded sort is
called for.

It's *so* impossible that maybe it's worth trading that capability
for a couple of bytes in the inner loop.

>> In the code, I could do something like
>>
>> #ifdef CONFIG_X86_64
>> /* Comment explaining why */
>> typedef uint32_t count_t;
>> #else
>> typedef size_t count_t;
>> #endif
>>
>> ...
>> count_t count = 0;
>
> Using different types makes it more complex, e.g. to print the value
> in debug code.
> And adding more typedefs is frowned upon.

It's a *local* typedef, purely for the purpose of moving #ifdef clutter
out of the function declaration.  I agree that *global* typedefs are
discouraged.

As for debugging, that's a red herring; it's easy to cast to (size_t).

> Just my 0.02€.

Thank you for considering the issue!

>> I prefer the shorter _ints and _longs names, but this is just
>> not a hill I want to die on.
>
> Argument of least surprise: don't call something a duck if it's not
> guaranteed to behave like a duck.
>
> If I read "long", this triggers a warning flag in my head: be careful,
> this is 32-bit on 32-bit platforms, and 64-bit on 64-bit platforms.

Good point.  And "_longlong" or "_llong" is a bit ugly, too.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-14 Thread George Spelvin
On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
> On Tue, Mar 05, 2019 at 03:06:44AM +0000, George Spelvin wrote:
>> +for (bit = 1; count & bit; bit <<= 1) {
>> +cur = merge(priv, (cmp_func)cmp, pending, cur);
>> +pending = pending->prev;  /* Untouched by merge() */
>>  }
>
> Wouldn't be it the same to
> 
>   bit = ffz(count);
>   while (bit--) {
>   ...
>   }
> ?
>
> Though I dunno which one is generating better code.

One question I should ask everyone: should "count" be 32 or 64 bits
on 64-bit machines?  That would let x86 save a few REX bytes.  (815
vs. 813 byte code, if anyone cares.)

Allegedy ARM can save a few pJ by gating the high 32
bits of the ALU.

Most other 64-bit processors would prefer 64-bit operations as
it saves masking operations.

If we never sort a list with more than 2^32 entries, it
makes no difference.

If we use a 32-bit count and we *do* sort a list with more than
2^32 entries, then it still sorts, but the performance degrades to
O((n/2^32)^2).

Just how often do we expect the kernel to face lists that long?
(Note that the old code was O((n/2^20)^2).)

In the code, I could do something like

#ifdef CONFIG_X86_64
/* Comment explaining why */
typedef uint32_t count_t;
#else
typedef size_t count_t;
#endif

...
count_t count = 0;


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
>> swap_bytes / swap_4byte_words / swap_8byte_words
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>> Pistols at dawn?

On Thu, 14 Mar 2019 at 22:59:55 +0300, Andrey Abramov wrote:
> Yes, in my opinion, swap_bytes / swap_ints / swap_longs are the
> most readable because we have both swap_ints and swap_longs functions
> (in one file near each other), so I don't think that there will be
> any confusion about size.

Yes, that's what I thought.  They're three related but different
functions, suffixed _bytes, _ints, and _longs.  What could the
difference possibly be?  And if anyone has any lingering doubts,
the functions are right there, with exquisitely clear comments.

No to mention where they're used.  Is "is_aligned(base, size, 8)"
remotely obscure?  Especially in context:

if (is_aligned(base, size, 8))
swap_func = swap_longs;
else if (is_aligned(base, size, 4))
swap_func = swap_ints;
else
swap_func = swap_bytes;

What subtle and mysterious code.

> But actually, it doesn't matter which name will you take, because
> the meaning of each, in my opinion, is obvious enough, so I don't
> mind about any of these options.

I'm just amazed that this piece of bikeshedding is the most
contentious thing about the patch series.

I mean, if I'd named them:
llanfairpwllgwyngyll()
shravanabelagola()
zheleznodorozhny()
or
peckish()
esuriant()
hungry()
then yes, those would be bad names.

I prefer the shorter _ints and _longs names, but this is just
not a hill I want to die on.


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
On Thu, 14 Mar 2019 at 11:41:26 +0100, Geert Uytterhoeven wrote:
> On Thu, Mar 14, 2019 at 11:10 AM George Spelvin  wrote:
>> On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
>>>> How about one of:
>>>> swap_bytes / swap_ints / swap_longs
>>>> swap_1 / swap_4 / swap_8
>>>
>>> longs are ambiguous, so I would prefer bit-sized types.
>>
>> I already implemented Andrey's suggestions, which were the exact
>> opposite of yours.
>>
>> Pistols at dawn?

I didn't explain the joke because jokes aren't funny if explained,
but just in case, by suggesting a clearly ridiculous method, I was
saying "I have no idea how to resolve this conflict."

> Prepared to fix all future long vs. int bugs?

In the entire kernel?  Or just the one small source file
where these statically scoped helper functions are visible?

In case of uncertainty, the comments and the code are right there
just a few lines away from the one and only call site, so I
don't expect much confusion.

I care a lot about function names when they are exported, but in
this case we're talking about what colour to paint the *inside* of
the bike shed.

I just want to pick some names and move on.  Since nothing so far seems
satsify everyone, I'll go with the ponderous but utterly unambiguous:

swap_bytes
swap_4byte_words
swap_8byte_words


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
>> Although I'm thinking of:
>>
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>>
>>  (void)base;
>> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>>
>> Any preference?
> I think it would be better.

>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> In my opinion "swap_bytes / swap_ints / swap_longs" are the most readable.


On Thu, 14 Mar 2019 at 11:29:58 +0200, Andy Shevchenko wrote:
> On Sat, Mar 09, 2019 at 03:53:41PM +, l...@sdf.org wrote:
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>> #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  (void)base;
>> #else
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>
>> Any preference?
>
> This one looks better in a sense we don't suppress the warnings when it's
> not needed.

>>> For such primitives that operates on top of an arrays we usually
>>> append 's' to the name. Currently the name is misleading.
>>> 
>>> Perhaps u32s_swap().
>> 
>> I don't worry much about the naming of static helper functions.
>> If they were exported, it would be a whole lot more important!
>> 
>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>
> For signedness we use prefixes; for plural, suffixes. I don't see the point of
> confusion. And this is in use in kernel a lot.
>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> longs are ambiguous, so I would prefer bit-sized types.

I already implemented Andrey's suggestions, which were the exact
opposite of yours.

Pistols at dawn?


 +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>> 
>>> Why #ifdef is better than if (IS_ENABLED()) ?
>> 
>> Because CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is bool and not
>> tristate.  IS_ENABLED tests for 'y' or 'm' but we don't need it
>> for something that's only on or off.
>
> There is IS_BUILTIN(), though it's a common practice to use IS_ENABLED()
> even for boolean options (I think because of naming of the macro).

Well, as I said earlier, #ifdef is the most common form in the kernel.
It's also the shortest to write, and I like the fact that it slightly
simpler.  (Admittedly, "IS_ENABLED" does not take a lot of brain power
to interpret, but it *is* one more macro that might be hiding magic.)

So I'm not inclined to change it without a substantial reason.


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
>> Although I'm thinking of:
>>
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>>
>>  (void)base;
>> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>>
>> Any preference?
> I think it would be better.

>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> In my opinion "swap_bytes / swap_ints / swap_longs" are the most readable.


On Thu, 14 Mar 2019 at 11:29:58 +0200, Andy Shevchenko wrote:
> On Sat, Mar 09, 2019 at 03:53:41PM +, l...@sdf.org wrote:
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>> #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  (void)base;
>> #else
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>
>> Any preference?
>
> This one looks better in a sense we don't suppress the warnings when it's
> not needed.

>>> For such primitives that operates on top of an arrays we usually
>>> append 's' to the name. Currently the name is misleading.
>>> 
>>> Perhaps u32s_swap().
>> 
>> I don't worry much about the naming of static helper functions.
>> If they were exported, it would be a whole lot more important!
>> 
>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>
> For signedness we use prefixes; for plural, suffixes. I don't see the point of
> confusion. And this is in use in kernel a lot.
>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> longs are ambiguous, so I would prefer bit-sized types.

I already implemented Andrey's suggestions, which were the exact
opposite of yours.

Pistols at dawn?


 +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>> 
>>> Why #ifdef is better than if (IS_ENABLED()) ?
>> 
>> Because CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is bool and not
>> tristate.  IS_ENABLED tests for 'y' or 'm' but we don't need it
>> for something that's only on or off.
>
> There is IS_BUILTIN(), though it's a common practice to use IS_ENABLED()
> even for boolean options (I think because of naming of the macro).

Well, as I said earlier, #ifdef is the most common form in the kernel.
It's also the shortest to write, and I like the fact that it slightly
simpler.  (Admittedly, "IS_ENABLED" does not take a lot of brain power
to interpret, but it *is* one more macro that might be hiding magic.)

So I'm not inclined to change it without a substantial reason.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-14 Thread George Spelvin
On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
> On Tue, Mar 05, 2019 at 03:06:44AM +0000, George Spelvin wrote:
>> +/* Do merges corresponding to set lsbits in count */
>
>> +for (bit = 1; count & bit; bit <<= 1) {
>> +cur = merge(priv, (cmp_func)cmp, pending, cur);
>> +pending = pending->prev;  /* Untouched by merge() */
>>  }
>
> Wouldn't be it the same to
>
>   bit = ffz(count);
>   while (bit--) {
>   ...
>   }
> ?
>
> Though I dunno which one is generating better code.

Yes, it's the same.  But count is an incrementing counter, so the
pattern of return values from ffz() is 01020103010201040102010301020105...,
which has mean value 1/2 + 1/4 + 1/8 + 1/16 +... = 1.

So spending one instruction on ffz() to save one instruction per loop
iteration is an even trade, and if the processor doesn't have an ffz()
instruction, it's a loss.

There's a third possible implementation:

>> +for (bit = count; bit & 1; bit >>= 1) {

...which works fine, too.  (It even saves a few bytes of code, so I
might switch to it.)  I used the form I did because my test code
verified that the length of the lists being merged equalled "bit".

The other forms don't have that property.


Thank you for looking at the code!


Re: [PATCH 5/5] lib/list_sort: Optimize number of calls to comparison function

2019-03-13 Thread George Spelvin
On Thu, 14 Mar 2019 at 00:28:16 +0100, Rasmus Villemoes wrote:
> On 05/03/2019 06.58, George Spelvin wrote:
>> This patch avoids badly unbalanced merges on unlucky input sizes.
>> It slightly increases the code size, but saves an average of 0.2*n
>> calls to cmp().
>> 
> [snip]
>> 
>> (I confess to being more than a little bit proud of how clean this
>> code turned out.  It took a lot of thinking, but the resultant inner
>> loop is very simple and efficient.)
>
> This is beautiful. So no comments on the patch itself. One thing that
> might be nice would be to see the reduction in number of cmp callbacks
> explicitly; it should be trivial to use the priv element for that in the
> list_sort_test module. But to really see it one would of course have to
> extend that test to do a lot more different sizes of lists.

And you'd have to compile and run two different kernels, because you
can't run the algorithms side-by-side.  Not a good way to do it.

I used a user-space test harness for testing.  The output is ugly,
but here are lots of numbers for various list sizes.

The first group is single sort invocations.  The list is sorted by
(random) key, then again by address (the inverse), and the number
of comparisons printed.  The numbers in parens are the linear
coefficient K in
comparisons = n*log2(n) - K*n,
i.e. it's
log2(n) - (comparisons / n).
Higher is better.

The three lines for each size are the original list_sort, a top-down
version I wrote as a "perfect" value for reference, and the posted
code.  (1034 and 1040 are particularly bad cases for the old code.)

1: 0 (0.00) 0 (0.00)
1: 0 (0.00) 0 (0.00)
1: 0 (0.00) 0 (0.00)
2: 1 (0.50) 1 (0.50)
2: 1 (0.50) 1 (0.50)
2: 1 (0.50) 1 (0.50)
3: 3 (0.584963) 3 (0.584963)
3: 2 (0.918296) 2 (0.918296)
3: 3 (0.584963) 3 (0.584963)
4: 5 (0.75) 5 (0.75)
4: 5 (0.75) 5 (0.75)
4: 5 (0.75) 5 (0.75)
5: 8 (0.721928) 8 (0.721928)
5: 8 (0.721928) 8 (0.721928)
5: 8 (0.721928) 8 (0.721928)
6: 10 (0.918296) 10 (0.918296)
6: 10 (0.918296) 10 (0.918296)
6: 10 (0.918296) 10 (0.918296)
7: 11 (1.235926) 12 (1.093069)
7: 13 (0.950212) 12 (1.093069)
7: 11 (1.235926) 12 (1.093069)
8: 17 (0.875000) 17 (0.875000)
8: 17 (0.875000) 17 (0.875000)
8: 17 (0.875000) 17 (0.875000)
9: 18 (1.169925) 22 (0.725481)
9: 21 (0.836592) 20 (0.947703)
9: 20 (0.947703) 20 (0.947703)
10: 20 (1.321928) 25 (0.821928)
10: 25 (0.821928) 24 (0.921928)
10: 22 (1.121928) 24 (0.921928)
13: 35 (1.008132) 35 (1.008132)
13: 33 (1.161978) 33 (1.161978)
13: 36 (0.931209) 35 (1.008132)
21: 71 (1.011365) 74 (0.868508)
21: 68 (1.154222) 71 (1.011365)
21: 69 (1.106603) 72 (0.963746)
31: 117 (1.180003) 111 (1.373551)
31: 114 (1.276777) 114 (1.276777)
31: 117 (1.180003) 111 (1.373551)
32: 123 (1.156250) 121 (1.218750)
32: 123 (1.156250) 121 (1.218750)
32: 123 (1.156250) 121 (1.218750)
33: 158 (0.256515) 149 (0.529243)
33: 130 (1.105000) 131 (1.074697)
33: 131 (1.074697) 131 (1.074697)
34: 142 (0.910992) 135 (1.116875)
34: 133 (1.175698) 135 (1.116875)
34: 131 (1.234522) 134 (1.146286)
55: 244 (1.344996) 249 (1.254087)
55: 246 (1.308632) 241 (1.399542)
55: 249 (1.254087) 250 (1.235905)
89: 484 (1.037531) 490 (0.970115)
89: 464 (1.262250) 477 (1.116183)
89: 479 (1.093711) 482 (1.060003)
127: 729 (1.248527) 727 (1.264275)
127: 734 (1.209157) 724 (1.287897)
127: 729 (1.248527) 727 (1.264275)
128: 744 (1.187500) 733 (1.273438)
128: 744 (1.187500) 733 (1.273438)
128: 744 (1.187500) 733 (1.273438)
129: 752 (1.181770) 853 (0.398824)
129: 746 (1.228282) 740 (1.274793)
129: 747 (1.220530) 741 (1.267041)
144: 926 (0.739369) 928 (0.725481)
144: 851 (1.260203) 866 (1.156036)
144: 865 (1.162981) 872 (1.114369)
233: 1556 (1.186075) 1541 (1.250452)
233: 1545 (1.233285) 1527 (1.310538)
233: 1550 (1.211826) 1534 (1.280495)
377: 2787 (1.165848) 2790 (1.157890)
377: 2752 (1.258686) 2771 (1.208288)
377: 2778 (1.189720) 2782 (1.179110)
610: 5115 (0.867420) 5115 (0.867420)
610: 4891 (1.234633) 4883 (1.247747)
610: 4909 (1.205124) 4930 (1.170698)
642: 5385 (0.938579) 5428 (0.871601)
642: 5166 (1.279701) 5185 (1.250105)
642: 5205 (1.218953) 5201 (1.225183)
987: 8574 (1.259976) 8620 (1.213370)
987: 8564 (1.270108) 8599 (1.234647)
987: 8565 (1.269095) 8614 (1.219449)
1022: 8937 (1.252561) 8913 (1.276044)
1022: 8916 (1.273109) 8928 (1.261367)
1022: 8937 (1.252561) 8913 (1.276044)
1023: 8959 (1.241015) 8909 (1.289891)
1023: 8927 (1.272295) 8918 (1.281093)
1023: 8959 (1.241015) 8909 (1.289891)
1024: 8970 (1.240234) 8916 (1.292969)
1024: 8966 (1.244141) 8912 (1.296875)
1024: 8966 (1.244141) 8912 (1.296875)
1025: 9548 (0.686286) 9724 (0.514579)
1025: 8971 (1.249213) 8943 (1.276530)
1025: 8970 (1.250189) 8944 (1.27)
1026: 9771 (0.479423) 9745 (0.504764)
1026: 8936 (1.293263) 8978 (1.252328)
1026: 8942 (1.287415) 8993 (1.237708)
1028: 9643 (0.625274) 9985 (0.292590)
1028: 8980 (1.270216)

Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant

2019-03-13 Thread George Spelvin
On Wed, 13 Mar 2019 at 23:29:40 +0100, Rasmus Villemoes wrote:
> On 21/02/2019 09.21, George Spelvin wrote:
>> +/**
>> + * parent - given the offset of the child, find the offset of the parent.
>> + * @i: the offset of the heap element whose parent is sought.  Non-zero.
>> + * @lsbit: a precomputed 1-bit mask, equal to "size & -size"
>> + * @size: size of each element
>> + *
>> + * In terms of array indexes, the parent of element j = i/size is simply
>> + * (j-1)/2.  But when working in byte offsets, we can't use implicit
>> + * truncation of integer divides.
>> + *
>> + * Fortunately, we only need one bit of the quotient, not the full divide.
>> + * size has a least significant bit.  That bit will be clear if i is
>> + * an even multiple of size, and set if it's an odd multiple.
>> + *
>> + * Logically, we're doing "if (i & lsbit) i -= size;", but since the
>> + * branch is unpredictable, it's done with a bit of clever branch-free
>> + * code instead.
>> + */
>> +__attribute_const__ __always_inline
>> +static size_t parent(size_t i, unsigned int lsbit, size_t size)
>> +{
>> +i -= size;
>> +i -= size & -(i & lsbit);
>> +return i / 2;
>> +}
>> +
>
> Really nice :) I had to work through this by hand, but it's solid.

Thank you!  Yes, the way the mask doesn't include the low-order bits
that don't matter anyway is a bit subtle.

When the code is subtle, use lots of comments.  The entire reason
for making this a separate helper function is to leave room for
the large comment.

>> +unsigned const lsbit = size & -size;/* Used to find parent */
>
> Nit: qualifier before type, "const unsigned". And this sets ZF, so a
> paranoid check for zero size (cf. the other mail) by doing "if (!lsbit)
> return;" is practically free. Though it's probably a bit obscure doing
> it that way...

Actually, this is a personal style thing which I can ignore for the sake
of the kernel, but I believe that it's better to put the qualifier
*after* the type.  This is due to C's pointer declaration syntax.

The standard example of the issue is:

typedef char *pointer;
const char *a;
char const *b;
char * const c;
const pointer d;
pointer const e;

Now, which variables are the same types?

The answer is that a & b are the same (mutable pointer to const
char), and c, d & e are the same (const pointer to mutable char).

I you make a habit of putting the qualifier *after* the type, then
a simple "textual substitution" mental model for the typedef works,
and it's clear that c and e are the same.

It's also clear that b cannot be represented by the typedef because
the const is between "char" and "*", and you obviously can't do that
with the typedef.

But if you put the qualifier first, it's annoying to rememeber why
a and d are not the same type.

So I've deliberately cultivated the style of putting the qualifier
after the type.

But if the kernel prefers it before...

>> +if (!n)
>> +return;
>
> I'd make that n <= 1. Shouldn't be much more costly.

(Actually, it's "num <= 1"; n is the pre-multiplied form so
n <= 1 can only happen when sorting one one-byte value.)

I actually thought about this and decided not to bother.  I did it
this way during development to stress the general-case code.  But
should I change it?

=== NEVER MIND ===

I had written a long reply justifying leaving it alone to save one
instruction when the light dawned: I can do *both* tests in one
step with
size_t n = num * size, a = (num/2) * size;
unsigned const lsbit = size & -size;/* Used to find parent */

if (!a) /* num < 2 || size == 0 */
return;

So now everyone's happy.

> Nice!

Thank you.  May I translate that into Acked-by?


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-13 Thread George Spelvin
Thank you for your thoughtful comments!

On Wed, 13 Mar 2019 at 23:23:44 +0100, Rasmus Villemoes wrote:
> On 21/02/2019 07.30, George Spelvin wrote:
> + * @align: required aignment (typically 4 or 8)
>
> typo aLignment

Thanks; fixed!

>> + * Returns true if elements can be copied using word loads and stores.
>> + * The size must be a multiple of the alignment, and the base address must
>> + * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
>
> I wonder if we shouldn't unconditionally require the base to be aligned.
> It of course depends on what exactly 'efficient' means, but if the base
> is not aligned, we know that every single load and store will be
> unaligned, and we're doing lots of them. IIRC even on x86 it's usually
> two cycles instead of one, and even more when we cross a cache line
> boundary (which we do rather often - e.g. in your 40 byte example,
> almost every swap will hit at least one). One could also have some data
> that is naturally 4-byte aligned with an element size that happens to be
> a multiple of 8 bytes; it would be a bit sad to use 8-byte accesses when
> 4-byte would all have been aligned.

Well, a 2-cycle unaligned access is still lots faster than 4 byte accesses.
I think that's a decent interpretation of "EFFICIENT_UNALIGNED_ACCESS":
faster than doing it a byte at a time.  So the change is a net win;
we're just wondering if it could be optimized more.

The 4/8 issue is similar, but not as extreme.  Is one unaligned 64-bit
access faster than two aligned 32-bit accesses?  As you note, it's usually
only twice the cost, so it's no slower, and there's less loop overhead.
So I don't think it's a bad thing here, either.

Re your comments on cache lines, ISTR that x86 can do an unaligned
load with minimal penalty as long as it doesn't cross a cache line:
https://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses/
So you win on most of the accesses, hopefully enough to pay for the
one unligned access.  Apparently in processors more recent than
the P4 example Intel used above, it's even less:
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

On 32-bit machines, it's actually a 4-byte swap, unrolled twice;
there are no 64-bit memory accesses.  So the concern is only about
8-byte alignment on 64-bit machines.

The great majority of call sites sort structures with pointer or
long members, so are aligned and the question is moot.  I don't
think it's worth overthinking the issue on behalf of the performance
of some rare corner cases.

I have considered doing away with the two word-copy loops and just
having one machine-word-sized loop plus a byte fallback.

>> +unsigned int lsbits = (unsigned int)size;
>
> Drop that cast.

Actually, in response to an earlier comment, I changed it (and the
type of lsbits) to (unsigned char), to emphasize that I *really* only
care about a handful of low bits.

I know gcc doesn't warn about implicit narrowing casts, but
I prefer to make them explicit for documentation reasons.
"Yes, I really mean to throw away the high bits."

Compilers on machines without native byte operations are very good
at using larger registers and optimizing away mask operations.
(Remember that this is inlined, and "align" is a constant 4 or 8.)
>
>> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> +(void)base;
>> +#else
>> +lsbits |= (unsigned int)(size_t)base;
>
> The kernel usually casts pointers to long or unsigned long. If some
> oddball arch has size_t something other than unsigned long, this would
> give a "warning: cast from pointer to integer of different size". So
> just cast to unsigned long, and drop the cast to unsigned int.

I changed this to uintptr_t and called it good.

>>  static void u32_swap(void *a, void *b, int size)
>>  {
>> -u32 t = *(u32 *)a;
>> -*(u32 *)a = *(u32 *)b;
>> -*(u32 *)b = t;
>> +size_t n = size;
>> +
>
> Since the callback has int in its prototype (we should fix that globally
> at some point...), might as well use a four-byte local variable, to
> avoid rex prefix on x86-64. So just make that "unsigned" or "u32".

Agreed about the eventual global fix.  If you care, the only places
a non-NULL swap function is passed in are:
- arch/arc/kernel/unwind.c
- arch/powerpc/kernel/module_32.c
- arch/powerpc/kernel/module_64.c
! arch/x86/kernel/unwind_orc.c (2x)
- fs/ocfs2/dir.c
- fs/ocfs2/refcounttree.c (3x)
- fs/ocfs2/xattr.c (3x)
- fs/ubifs/find.c
! kernel/jump_label.c
! lib/extable.c

The ones marked with "-" are simple memory swaps that could (should!)
be replaced with NULL.  The ones marked with "!" actually do
something non-trivial.


Actually, I deliberately used a pointer-sized index, since th

Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-10 Thread George Spelvin
Rasmus Villemoes wrote:
> On 05/03/2019 04.06, George Spelvin wrote:
>> + * (Actually, it is always called with @a being the element which was
>> + * originally first, so it is not necessary to to distinguish the @a < @b
>> + * and @a == @b cases; the return value may be a simple boolean.  But if
>> + * you ever *use* this freedom, be sure to update this comment to document
>> + * that code now depends on preserving this property!)
>
> This was and still is used at least by the block layer, and likely
> others as well. While 3110fc79606fb introduced a bunch of if() return -1
> else if () ... stuff, it still ends with a 0/1 result. Before
> 3110fc79606fb, it was even more obvious that this property was used.

Ah, thank you!  I actually read through every list_sort caller in
the kernel to see if I could find anywhere that used it and couldn't,
but I didn't study this code carefully enough to see that it does
in the last step.

Since someone *does* use this, I'll change the comment signiicantly.

> Grepping around shows that this could probably be used in more places,
> gaining a cycle or two per cmp callback, e.g. xfs_buf_cmp. But that's of
> course outside the scope of this series.

The one that misled me at first was _xfs_buf_obj_cmp, which returns 0/1,
but that's not used by list_sort().  xfs_buf_cmp returns -1/0/+1.

As you might see from the comment around the cmp_func typedef,
there are other things that could be cleaned up if we did a pass
over all the call sites.

(I'm almost tempted to tell the compiler than cmp_func is const,
since it's supposed to be independent of the pointer frobbing that
list_sort does, but then I remember Henry Spencer's maxim about
lying to the compiler.)


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-09 Thread George Spelvin
Andy Shevchenko wrote:
> Shouldn't simple memcpy cover these case for both 32- and 64-bit 
> architectures?

Speaking of replacing swap with copying via temporary buffers, one
idea that did come to mind was avoiding swap for sufficiently small
objects.

Every sift-down is actually a circular shift.  Once the target
position hs been found, we rotate the path from the root to the
target up one, with the target filled in by the previous root.

(When breaking down the heap, there's one additional link in the
cycle, where the previous root goes to the end of the heap and the
end of the heap goes to the target.)

If we had a temporary buffer (128 bytes would handle most things),
we could copy the previous root there and *copy*, rather than swap,
the elements on the path to the target up, then finally copy the
previous root to the target.

However, to rotate up, the this must be done in top-down order.
The way the code currently works with premultiplied offsets, it's
easy to traverse bottom-up, but very awkward to retrace the
top-down path.

The only solution I can think of is to build a bitmap of the
left/right turnings from the root to the leaf, and then back it up
to the target.  There are a few ways to do this:

1) The obvious big-endian order.  The bitmap is simply the 1-based
   position of the leaf. To add a level, shift left and add the new
   bit at the bottom.  To back up a step, shift right.

   To retrace, create a 1-bit mask equal to the msbit of the index
   ((smear_right(x) >> 1) + 1) and repeatedly shift the mask right.

2) The reverse order.  We use a 1-bit mask while building the
   bitmap, and while retracing, just examine the lsbit while shifting
   the bitmap right.

3) As option 1, but don't build the bitmap as we're walking down;
   rather reconstruct it from the premultiplied offset using
   reciprocal_divide().

Nothing really jumps out to me as The Right Way to do it.

I don't want to bloat the code to the point that it would be
easier to implement a different algorithm entirely.


[PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-08 Thread George Spelvin
Rather than u32_swap and u64_swap working on 4- and 8-byte objects
directly, let them handle any multiple of 4 or 8 bytes.  This speeds
up most users of sort() by avoiding fallback to the byte copy loop.

Despite what commit ca96ab859ab4 ("lib/sort: Add 64 bit swap function")
claims, very few users of sort() sort pointers (or pointer-sized
objects); most sort structures containing at least two words.
(E.g. drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte
struct acpi_fan_fps.)

x86-64 code size 872 -> 885 bytes (+8)

Signed-off-by: George Spelvin 
---
 lib/sort.c | 117 +++--
 1 file changed, 96 insertions(+), 21 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index d6b7a202b0b6..dff2ab2e196e 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -11,35 +11,110 @@
 #include 
 #include 
 
-static int alignment_ok(const void *base, int align)
+/**
+ * alignment_ok - is this pointer & size okay for word-wide copying?
+ * @base: pointer to data
+ * @size: size of each element
+ * @align: required aignment (typically 4 or 8)
+ *
+ * Returns true if elements can be copied using word loads and stores.
+ * The size must be a multiple of the alignment, and the base address must
+ * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
+ *
+ * For some reason, gcc doesn't know to optimize "if (a & mask || b & mask)"
+ * to "if ((a | b) & mask)", so we do that by hand.
+ */
+static bool __attribute_const__
+alignment_ok(const void *base, size_t size, unsigned int align)
 {
-   return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
-   ((unsigned long)base & (align - 1)) == 0;
+   unsigned int lsbits = (unsigned int)size;
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   (void)base;
+#else
+   lsbits |= (unsigned int)(size_t)base;
+#endif
+   lsbits &= align - 1;
+   return lsbits == 0;
 }
 
+/**
+ * u32_swap - swap two elements in 32-bit chunks
+ * @a, @b: pointers to the elements
+ * @size: element size (must be a multiple of 4)
+ *
+ * Exchange the two objects in memory.  This exploits base+index addressing,
+ * which basically all CPUs have, to minimize loop overhead computations.
+ *
+ * For some reason, on x86 gcc 7.3.0 adds a redundant test of n at the
+ * bottom of the loop, even though the zero flag is stil valid from the
+ * subtract (since the intervening mov instructions don't alter the flags).
+ * Gcc 8.1.0 doesn't have that problem.
+ */
 static void u32_swap(void *a, void *b, int size)
 {
-   u32 t = *(u32 *)a;
-   *(u32 *)a = *(u32 *)b;
-   *(u32 *)b = t;
+   size_t n = size;
+
+   do {
+   u32 t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+   } while (n);
 }
 
+/**
+ * u64_swap - swap two elements in 64-bit chunks
+ * @a, @b: pointers to the elements
+ * @size: element size (must be a multiple of 8)
+ *
+ * Exchange the two objects in memory.  This exploits base+index
+ * addressing, which basically all CPUs have, to minimize loop overhead
+ * computations.
+ *
+ * We'd like to use 64-bit loads if possible.  If they're not, emulating
+ * one requires base+index+4 addressing which x86 has but most other
+ * processors do not.  If CONFIG_64BIT, we definitely have 64-bit loads,
+ * but it's possible to have 64-bit loads without 64-bit pointers (e.g.
+ * x32 ABI).  Are there any cases the kernel needs to worry about?
+ */
+
 static void u64_swap(void *a, void *b, int size)
 {
-   u64 t = *(u64 *)a;
-   *(u64 *)a = *(u64 *)b;
-   *(u64 *)b = t;
-}
-
-static void generic_swap(void *a, void *b, int size)
-{
-   char t;
+   size_t n = size;
 
do {
-   t = *(char *)a;
-   *(char *)a++ = *(char *)b;
-   *(char *)b++ = t;
-   } while (--size > 0);
+#ifdef CONFIG_64BIT
+   u64 t = *(u64 *)(a + (n -= 8));
+   *(u64 *)(a + n) = *(u64 *)(b + n);
+   *(u64 *)(b + n) = t;
+#else
+   /* Use two 32-bit transfers to avoid base+index+4 addressing */
+   u32 t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+
+   t = *(u32 *)(a + (n -= 4));
+   *(u32 *)(a + n) = *(u32 *)(b + n);
+   *(u32 *)(b + n) = t;
+#endif
+   } while (n);
+}
+
+/**
+ * generic_swap - swap two elements a byte at a time
+ * @a, @b: pointers to the elements
+ * @size: element size
+ *
+ * This is the fallback if alignment doesn't allow using larger chunks.
+ */
+static void generic_swap(void *a, void *b, int size)
+{
+   size_t n = size;
+
+   do {
+   char t = ((char *)a)[--n];
+   ((char *)a)[n] = ((char *)b)[n];
+   ((char *)b)[n] = t;
+   } while (n);
 }
 
 /**
@@ -67,10 +142,10 @@ void sort(void *base, si

[PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-08 Thread George Spelvin
Rather than a fixed-size array of pending sorted runs, use the ->prev
links to keep track of things.  This reduces stack usage, eliminates
some ugly overflow handling, and reduces the code size.

Also:
* merge() no longer needs to handle NULL inputs, so simplify.
* The same applies to merge_and_restore_back_links(), which is renamed
  to the less ponderous merge_final().  (It's a static helper function,
  so we don't need a super-descriptive name; comments will do.)

x86-64 code size 1086 -> 740 bytes (-346)

(Yes, I see checkpatch complaining about no space after comma in
"__attribute__((nonnull(2,3,4,5)))".  Checkpatch is wrong.)

Signed-off-by: George Spelvin 
---
 include/linux/list_sort.h |   1 +
 lib/list_sort.c   | 152 --
 2 files changed, 96 insertions(+), 57 deletions(-)

diff --git a/include/linux/list_sort.h b/include/linux/list_sort.h
index ba79956e848d..20f178c24e9d 100644
--- a/include/linux/list_sort.h
+++ b/include/linux/list_sort.h
@@ -6,6 +6,7 @@
 
 struct list_head;
 
+__attribute__((nonnull(2,3)))
 void list_sort(void *priv, struct list_head *head,
   int (*cmp)(void *priv, struct list_head *a,
  struct list_head *b));
diff --git a/lib/list_sort.c b/lib/list_sort.c
index 85759928215b..e4819ef0426b 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -7,33 +7,47 @@
 #include 
 #include 
 
-#define MAX_LIST_LENGTH_BITS 20
+/*
+ * By declaring the compare function with the __pure attribute, we give
+ * the compiler more opportunity to optimize.  Ideally, we'd use this in
+ * the prototype of list_sort(), but that would involve a lot of churn
+ * at all call sites, so just cast the function pointer passed in.
+ */
+typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *,
+   struct list_head const *, struct list_head const *);
 
 /*
  * Returns a list organized in an intermediate format suited
  * to chaining of merge() calls: null-terminated, no reserved or
  * sentinel head node, "prev" links not maintained.
  */
-static struct list_head *merge(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
+__attribute__((nonnull(2,3,4)))
+static struct list_head *merge(void *priv, cmp_func cmp,
struct list_head *a, struct list_head *b)
 {
-   struct list_head head, *tail = 
+   struct list_head *head, **tail = 
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
-   tail->next = a;
+   if (cmp(priv, a, b) <= 0) {
+   *tail = a;
+   tail = >next;
a = a->next;
+   if (!a) {
+   *tail = b;
+   break;
+   }
} else {
-   tail->next = b;
+   *tail = b;
+   tail = >next;
b = b->next;
+   if (!b) {
+   *tail = a;
+   break;
+   }
}
-   tail = tail->next;
}
-   tail->next = a?:b;
-   return head.next;
+   return head;
 }
 
 /*
@@ -43,44 +57,52 @@ static struct list_head *merge(void *priv,
  * prev-link restoration pass, or maintaining the prev links
  * throughout.
  */
-static void merge_and_restore_back_links(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
-   struct list_head *head,
-   struct list_head *a, struct list_head *b)
+__attribute__((nonnull(2,3,4,5)))
+static void merge_final(void *priv, cmp_func cmp, struct list_head *head,
+   struct list_head *a, struct list_head *b)
 {
struct list_head *tail = head;
u8 count = 0;
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
+   if (cmp(priv, a, b) <= 0) {
tail->next = a;
a->prev = tail;
+   tail = a;
a = a->next;
+   if (!a)
+   break;
} else {
tail->next = b;
b->prev = tail;
+   tail = b;
b = b->next;
+   if (!b) {
+   b = a;
+   break;
+

[PATCH 5/5] lib/list_sort: Optimize number of calls to comparison function

2019-03-08 Thread George Spelvin
CONFIG_RETPOLINE has severely degraded indirect function call
performance, so it's worth putting some effort into reducing
the number of times cmp() is called.

This patch avoids badly unbalanced merges on unlucky input sizes.
It slightly increases the code size, but saves an average of 0.2*n
calls to cmp().

x86-64 code size 740 -> 820 bytes (+80)

Unfortunately, there's not a lot of low-hanging fruit in a merge sort;
it already performs only n*log2(n) - K*n + O(1) compares.  The leading
coefficient is already the lowest theoretically possible (log2(n!)
corresponds to K=1.4427), so we're fighting over the linear term, and
the best mergesort can do is K=1.2645, achieved when n is a power of 2.

The differences between mergesort variants appear when n is *not*
a power of 2; K is a function of the fractional part of log2(n).
Top-down mergesort does best of all, achieving a minimum K=1.2408, and
an average (over all sizes) K=1.248.  However, that requires knowing the
number of entries to be sorted ahead of time, and making a full pass
over the input to count it conflicts with a second performance goal,
which is cache blocking.

Obviously, we have to read the entire list into L1 cache at some point,
and performance is best if it fits.  But if it doesn't fit, each full
pass over the input causes a cache miss per element, which is undesirable.

While textbooks explain bottom-up mergesort as a succession of merging
passes, practical implementations do merging in depth-first order:
as soon as two lists of the same size are available, they are merged.
This allows as many merge passes as possible to fit into L1; only the
final few merges force cache misses.

This cache-friendly depth-first merge order depends on us merging the
beginning of the input as much as possible before we've even seen the
end of the input (and thus know its size).

The simple eager merge pattern causes bad performance when n is just
over a power of 2.  If n=1028, the final merge is between 1024- and
4-element lists, which is wasteful of comparisons.  (This is actually
worse on average than n=1025, because a 1204:1 merge will, on average,
end after 512 compares, while 1024:4 will walk 4/5 of the list.)

Because of this, bottom-up mergesort achieves K < 0.5 for such sizes,
and has an average (over all sizes) K of around 1.  (My experiments
show K=1.01, while theory predicts K=0.965.)

There are "worst-case optimal" variants of bottom-up mergesort which
avoid this bad performance, but the algorithms given in the literature,
such as queue-mergesort and boustrodephonic mergesort, depend on the
breadth-first multi-pass structure that we are trying to avoid.

This implementation is as eager as possible while ensuring that all merge
passes are at worst 1:2 unbalanced.  This achieves the same average
K=1.207 as queue-mergesort, which is 0.2*n better then bottom-up, and
only 0.04*n behind top-down mergesort.

Specifically, it merges two lists of size 2^k as soon as it is known
that there are 2^k additional inputs following.  This ensures that the
final uneven merges triggered by reaching the end of the input will be
at worst 2:1.  This will avoid cache misses as long as 3*2^k elements
fit into the cache.

(I confess to being more than a little bit proud of how clean this
code turned out.  It took a lot of thinking, but the resultant inner
loop is very simple and efficient.)

Refs:
  Bottom-up Mergesort: A Detailed Analysis
  Wolfgang Panny, Helmut Prodinger
  Algorithmica 14(4):340--354, October 1995
  https://doi.org/10.1007/BF01294131
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260

  The cost distribution of queue-mergesort, optimal mergesorts, and
  power-of-two rules
  Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen
  Journal of Algorithms 30(2); Pages 423--448, February 1999
  https://doi.org/10.1006/jagm.1998.0986
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380

  Queue-Mergesort
  Mordecai J. Golin, Robert Sedgewick
  Information Processing Letters, 48(5):253--259, 10 December 1993
  https://doi.org/10.1016/0020-0190(93)90088-q
  https://sci-hub.tw/10.1016/0020-0190(93)90088-Q

Signed-off-by: George Spelvin 
---
 lib/list_sort.c | 111 ++--
 1 file changed, 88 insertions(+), 23 deletions(-)

diff --git a/lib/list_sort.c b/lib/list_sort.c
index e4819ef0426b..06df9b283c40 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -113,11 +113,6 @@ static void merge_final(void *priv, cmp_func cmp, struct 
list_head *head,
  * @head: the list to sort
  * @cmp: the elements comparison function
  *
- * This function implements a bottom-up merge sort, which has O(nlog(n))
- * complexity.  We use depth-first order to take advantage of cacheing.
- * (I.e. when we get to the fourth element, we immediately merge the
- * first two 2-element lists.)
- *
  * The comparison function @cmp must return a negative value if @a
  * should sort before @b, and a positive value i

[PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant

2019-03-08 Thread George Spelvin
This uses fewer comparisons than the previous code (61% as
many for large random inputs), but produces identical results;
it actually performs the exact same series of swap operations.

Standard heapsort, when sifting down, performs two comparisons
per level: One to find the greater child, and a second to see
if the current node should be exchanged with that child.

Bottom-up heapsort observes that it's better to postpone the second
comparison and search for the leaf where -infinity would be sent to,
then search back *up* for the current node's destination.

Since sifting down usually proceeds to the leaf level (that's where
half the nodes are), this does many fewer second comparisons.  That
saves a lot of (expensive since Spectre) indirect function calls.

The one time it's worse than the previous code is if there are large
numbers of duplicate keys, when the top-down algorithm is O(n) and
bottom-up is O(n log n).  For distinct keys, it's provably always better.

(The code is not significantly more complex.  This patch also merges
the heap-building and -extracting sift-down loops, resulting in a
net code size savings.)

x86-64 code size 885 -> 770 bytes (-115)

(I see the checkpatch complaint about "else if (n -= size)".
The alternative is significantly uglier.)

Signed-off-by: George Spelvin 
---
 lib/sort.c | 102 +++--
 1 file changed, 75 insertions(+), 27 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index dff2ab2e196e..2aef4631e7d3 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -117,6 +117,32 @@ static void generic_swap(void *a, void *b, int size)
} while (n);
 }
 
+/**
+ * parent - given the offset of the child, find the offset of the parent.
+ * @i: the offset of the heap element whose parent is sought.  Non-zero.
+ * @lsbit: a precomputed 1-bit mask, equal to "size & -size"
+ * @size: size of each element
+ *
+ * In terms of array indexes, the parent of element j = i/size is simply
+ * (j-1)/2.  But when working in byte offsets, we can't use implicit
+ * truncation of integer divides.
+ *
+ * Fortunately, we only need one bit of the quotient, not the full divide.
+ * size has a least significant bit.  That bit will be clear if i is
+ * an even multiple of size, and set if it's an odd multiple.
+ *
+ * Logically, we're doing "if (i & lsbit) i -= size;", but since the
+ * branch is unpredictable, it's done with a bit of clever branch-free
+ * code instead.
+ */
+__attribute_const__ __always_inline
+static size_t parent(size_t i, unsigned int lsbit, size_t size)
+{
+   i -= size;
+   i -= size & -(i & lsbit);
+   return i / 2;
+}
+
 /**
  * sort - sort an array of elements
  * @base: pointer to data to sort
@@ -125,21 +151,26 @@ static void generic_swap(void *a, void *b, int size)
  * @cmp_func: pointer to comparison function
  * @swap_func: pointer to swap function or NULL
  *
- * This function does a heapsort on the given array. You may provide a
- * swap_func function optimized to your element type.
+ * This function does a heapsort on the given array.  You may provide a
+ * swap_func function if you need to do something more than a memory copy
+ * (e.g. fix up pointers or auxiliary data), but the built-in swap isn't
+ * usually a bottleneck.
  *
  * Sorting time is O(n log n) both on average and worst-case. While
  * qsort is about 20% faster on average, it suffers from exploitable
  * O(n*n) worst-case behavior and extra memory requirements that make
  * it less suitable for kernel use.
  */
-
 void sort(void *base, size_t num, size_t size,
  int (*cmp_func)(const void *, const void *),
  void (*swap_func)(void *, void *, int size))
 {
/* pre-scale counters for performance */
-   int i = (num/2 - 1) * size, n = num * size, c, r;
+   size_t n = num * size, a = (num/2) * size;
+   unsigned const lsbit = size & -size;/* Used to find parent */
+
+   if (!n)
+   return;
 
if (!swap_func) {
if (alignment_ok(base, size, 8))
@@ -150,30 +181,47 @@ void sort(void *base, size_t num, size_t size,
swap_func = generic_swap;
}
 
-   /* heapify */
-   for ( ; i >= 0; i -= size) {
-   for (r = i; r * 2 + size < n; r  = c) {
-   c = r * 2 + size;
-   if (c < n - size &&
-   cmp_func(base + c, base + c + size) < 0)
-   c += size;
-   if (cmp_func(base + r, base + c) >= 0)
-   break;
-   swap_func(base + r, base + c, size);
-   }
-   }
+   /*
+* Loop invariants:
+* 1. elements [a,n) satisfy the heap property (compare greater than
+*all of their children),
+* 2. elements [n,num*size) 

[PATCH 3/5] lib/sort: Avoid indirect calls to built-in swap

2019-03-08 Thread George Spelvin
Similar to what's being done in the net code, this takes advantage of
the fact that most invocations use only a few common swap functions, and
replaces indirect calls to them with (highly predictable) conditional
branches.  (The downside, of course, is that if you *do* use a custom
swap function, there are a few additional (highly predictable) conditional
branches on the code path.)

This actually *shrinks* the x86-64 code, because it inlines the various
swap functions inside do_swap, eliding function prologues & epilogues.

x86-64 code size 770 -> 709 bytes (-61)

Signed-off-by: George Spelvin 
---
 lib/sort.c | 45 -
 1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/lib/sort.c b/lib/sort.c
index 2aef4631e7d3..226a8c7e4b9a 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -117,6 +117,33 @@ static void generic_swap(void *a, void *b, int size)
} while (n);
 }
 
+typedef void (*swap_func_t)(void *a, void *b, int size);
+
+/*
+ * The values are arbitrary as long as they can't be confused with
+ * a pointer, but small integers make for the smallest compare
+ * instructions.
+ */
+#define U64_SWAP (swap_func_t)0
+#define U32_SWAP (swap_func_t)1
+#define GENERIC_SWAP (swap_func_t)2
+
+/*
+ * The function pointer is last to make tail calls most efficient if the
+ * compiler decides not to inline this function.
+ */
+static void do_swap(void *a, void *b, int size, swap_func_t swap_func)
+{
+   if (swap_func == U64_SWAP)
+   u64_swap(a, b, size);
+   else if (swap_func == U32_SWAP)
+   u32_swap(a, b, size);
+   else if (swap_func == GENERIC_SWAP)
+   generic_swap(a, b, size);
+   else
+   swap_func(a, b, size);
+}
+
 /**
  * parent - given the offset of the child, find the offset of the parent.
  * @i: the offset of the heap element whose parent is sought.  Non-zero.
@@ -151,10 +178,10 @@ static size_t parent(size_t i, unsigned int lsbit, size_t 
size)
  * @cmp_func: pointer to comparison function
  * @swap_func: pointer to swap function or NULL
  *
- * This function does a heapsort on the given array.  You may provide a
- * swap_func function if you need to do something more than a memory copy
- * (e.g. fix up pointers or auxiliary data), but the built-in swap isn't
- * usually a bottleneck.
+ * This function does a heapsort on the given array.  You may provide
+ * a swap_func function if you need to do something more than a memory
+ * copy (e.g. fix up pointers or auxiliary data), but the built-in swap
+ * avoids a slow retpoline and so is significantly faster.
  *
  * Sorting time is O(n log n) both on average and worst-case. While
  * qsort is about 20% faster on average, it suffers from exploitable
@@ -174,11 +201,11 @@ void sort(void *base, size_t num, size_t size,
 
if (!swap_func) {
if (alignment_ok(base, size, 8))
-   swap_func = u64_swap;
+   swap_func = U64_SWAP;
else if (alignment_ok(base, size, 4))
-   swap_func = u32_swap;
+   swap_func = U32_SWAP;
else
-   swap_func = generic_swap;
+   swap_func = GENERIC_SWAP;
}
 
/*
@@ -194,7 +221,7 @@ void sort(void *base, size_t num, size_t size,
if (a)  /* Building heap: sift down --a */
a -= size;
else if (n -= size) /* Sorting: Extract root to --n */
-   swap_func(base, base + n, size);
+   do_swap(base, base + n, size, swap_func);
else/* Sort complete */
break;
 
@@ -221,7 +248,7 @@ void sort(void *base, size_t num, size_t size,
c = b;  /* Where "a" belongs */
while (b != a) {/* Shift it into place */
b = parent(b, lsbit, size);
-   swap_func(base + b, base + c, size);
+   do_swap(base + b, base + c, size, swap_func);
}
}
 }
-- 
2.20.1



[PATCH 0/5] lib/sort & lib/list_sort: faster and smaller

2019-03-08 Thread George Spelvin
Because CONFIG_RETPOLINE has made indirect calls much more expensive,
I thought I'd try to reduce the number made by the library sort
functions.

The first three patches apply to lib/sort.c.

Patch #1 is a simple optimization.  The built-in swap has rarely-used
special cases for aligned 4- and 8-byte objects.  But that case almost
never happens; most calls to sort() work on larger structures, which
fall back to the byte-at-a-time loop.  This generalizes them to aligned
*multiples* of 4 and 8 bytes.  (If nothing else, it saves an awful lot
of energy by not thrashing the store buffers as much.)

(Issue for disussion: should the special-case swap loops be reduced to
two, an aligned-word and generic byte verison?)

Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that
nice simple solid heapsort is preferable to more complex algorithms
(sorry, Andrey), but it's possible to implement heapsort with 40% fewer
comparisons than the way it's been done up to now.  And with some care,
the code ends up smaller, as well.  This is the "big win" patch.

Patch #3 adds the same sort of indirect call bypass that has been added
to the net code of late.  The great majority of the callers use the
builtin swap functions, so replace the indirect call to sort_func with a
(highly preditable) series of if() statements.  Rather surprisingly,
this decreased code size, as the swap functions were inlined and their
prologue & epilogue code eliminated.

lib/list_sort.c is a bit trickier, as merge sort is already close to
optimal, and we don't want to introduce triumphs of theory over
practicality like the Ford-Johnson merge-insertion sort.

Patch #4, without changing the algorithm, chops 32% off the code size and
removes the part[MAX_LIST_LENGTH+1] pointer array (and the corresponding
upper limit on efficiently sortable input size).

Patch #5 improves the algorithm.  The previous code is already optimal
for power-of-two (or slightly smaller) size inputs, but when the input
size is just over a power of 2, there's a very unbalanced final merge.

There are, in the literature, several algorithms which solve this, but
they all depend on the "breadth-first" merge order which was replaced
by commit 835cc0c8477f with a more cache-friendly "depth-first" order.
Some hard thinking came up with a depth-first algorithm which defers
merges as little as possible while avoiding bad merges.  This saves
0.2*n compares, averaged over all sizes.

The code size increase is minimal (80 bytes on x86-64, reducing the net
savings to 24%), but the comments expanded significantly to document
the clever algorithm.


TESTING NOTES: I have some ugly user-space benchmarking code
which I used for testing before moving this code into the kernel.
Shout if you want a copy.

I'm running this code right now, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since
the last round of minor edits to quell checkpatch.  I figure there
will be at least one round of comments and final testing.

George Spelvin (5):
  lib/sort: Make swap functions more generic
  lib/sort: Use more efficient bottom-up heapsort variant
  lib/sort: Avoid indirect calls to built-in swap
  lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS
  lib/list_sort: Optimize number of calls to comparison function

 include/linux/list_sort.h |   1 +
 lib/list_sort.c   | 225 --
 lib/sort.c| 250 ++
 3 files changed, 365 insertions(+), 111 deletions(-)

-- 
2.20.1



Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-28 Thread George Spelvin
Hannes Frederic Sowa wrote:
> We call extract_crng when we run out of batched entropy and reseed. How
> often we call down to extract_crng depends on how much entropy we
> extracted by calls to get_random_int/long, so the number of calls into
> those functions matter.
> 
> In extract_crng we have a timer which reseeds every 300s the CPRNG and
> either uses completely new entropy from the CRNG or calls down into the
> CPRNG while also doing backtracing protection (which feeds chacha's
> block size / 2 back into chacha, if I read the code correctly, thus
> 1024 bits, which should be enough).

In the current code, _extract_crng checks to see if more than 300 s
have elapsed since last time it was reseeded, and if so, reseeds with
fresh entropy.

In addition, on every read (or get_random_bytes), if the request leaves
enough ranfom bits in the last ChaCha block, it feeds back 256 bits
(the ChaCha block size is 16*32 = 512 bits) for anti-backtracking.

If the last read happened to not fit under that limit (size % 512 >
256), *and* there are no calls for RNG output for a long time, there is
no  upper limit to how long the old ChaCha key can hang around.

> On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote:
>> For example, two mix-backs of 64 bits gives you 65 bit security, not 128.
>> (Because each mixback can be guessed and verified separately.)

> Exactly, but the full reseed after running out of entropy is strong
> enough to not be defeated by your argumentation. Neither the reseed
> from the CRNG.

Yes, I was just reacting to your original statement:

>>>>> couldn't we simply use 8 bytes of the 64 byte
>>>>> return block to feed it directly back into the state chacha?

It's not the idea that's bad, just the proposed quantity.


>> If you want that, I have a pile of patches to prandom I really
>> should push upstream.  Shall I refresh them and send them to you?

> I would like to have a look at them in the new year, certainly! I can
> also take care about the core prandom patches, but don't know if I have
> time to submit the others to the different subsystems.
>
> Maybe, if David would be okay with that, we can submit all patches
> through his tree, as he is also the dedicated maintainer for prandom.

Amazing, thank you very much!  They're just minor cleanups, nothing
too exciting.  I'll put it in the queue to make sure they're up to
date.


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-28 Thread George Spelvin
Hannes Frederic Sowa wrote:
> We call extract_crng when we run out of batched entropy and reseed. How
> often we call down to extract_crng depends on how much entropy we
> extracted by calls to get_random_int/long, so the number of calls into
> those functions matter.
> 
> In extract_crng we have a timer which reseeds every 300s the CPRNG and
> either uses completely new entropy from the CRNG or calls down into the
> CPRNG while also doing backtracing protection (which feeds chacha's
> block size / 2 back into chacha, if I read the code correctly, thus
> 1024 bits, which should be enough).

In the current code, _extract_crng checks to see if more than 300 s
have elapsed since last time it was reseeded, and if so, reseeds with
fresh entropy.

In addition, on every read (or get_random_bytes), if the request leaves
enough ranfom bits in the last ChaCha block, it feeds back 256 bits
(the ChaCha block size is 16*32 = 512 bits) for anti-backtracking.

If the last read happened to not fit under that limit (size % 512 >
256), *and* there are no calls for RNG output for a long time, there is
no  upper limit to how long the old ChaCha key can hang around.

> On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote:
>> For example, two mix-backs of 64 bits gives you 65 bit security, not 128.
>> (Because each mixback can be guessed and verified separately.)

> Exactly, but the full reseed after running out of entropy is strong
> enough to not be defeated by your argumentation. Neither the reseed
> from the CRNG.

Yes, I was just reacting to your original statement:

>>>>> couldn't we simply use 8 bytes of the 64 byte
>>>>> return block to feed it directly back into the state chacha?

It's not the idea that's bad, just the proposed quantity.


>> If you want that, I have a pile of patches to prandom I really
>> should push upstream.  Shall I refresh them and send them to you?

> I would like to have a look at them in the new year, certainly! I can
> also take care about the core prandom patches, but don't know if I have
> time to submit the others to the different subsystems.
>
> Maybe, if David would be okay with that, we can submit all patches
> through his tree, as he is also the dedicated maintainer for prandom.

Amazing, thank you very much!  They're just minor cleanups, nothing
too exciting.  I'll put it in the queue to make sure they're up to
date.


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-23 Thread George Spelvin
Hannes Frederic Sowa wrote:
> On 24.12.2016 00:39, George Spelvin wrote:
>> We just finished discussing why 8 bytes isn't enough.  If you only
>> feed back 8 bytes, an attacker who can do 2^64 computation can find it
>> (by guessing and computing forward to verify the guess) and recover the
>> previous state.  You need to feed back at least as much output as your
>> security targete.  For /dev/urandom's ChaCha20, that's 256 bits.

> I followed the discussion but it appeared to me that this has the
> additional constraint of time until the next reseeding event happenes,
> which is 300s (under the assumption that calls to get_random_int happen
> regularly, which I expect right now). After that the existing reseeding
> mechansim will ensure enough backtracking protection. The number of
> bytes can easily be increased here, given that reseeding was shown to be
> quite fast already and we produce enough output. But I am not sure if
> this is a bit overengineered in the end?

I'm not following your description of how the time-based and call-based
mechanisms interact, but for any mix-back, you should either do enough
or none at all.  (Also called "catastrophic reseeding".)

For example, two mix-backs of 64 bits gives you 65 bit security, not 128.
(Because each mixback can be guessed and verified separately.)

> Also agreed. Given your solution below to prandom_u32, I do think it
> might also work without the seqlock now.

It's not technically a seqlock; in particular the reader doesn't
spin.  But the write side, and general logic is so similar it's
a good mental model.

Basically, assume a 64-byte buffer.  The reader has gone through
32 bytes of it, and has 32 left, and when he reads another 8 bytes,
has to distinguish three cases:

1) No update; we read the old bytes and there are now 32 - 24 bytes left.
2) Update completed while we weren't looking.  There are now new
   bytes in the buffer, and we took 8 leaving 64 - 8 = 56.
3) Update in progress at the time of the read.  We don't know if we
   are seeing old bytes or new bytes, so we have to assume the worst
   and not proceeed unless 32 >= 8, but assume at the end there are
   64 - 8 = 56 new bytes left.

> I wouldn't have added a disable irqs, but given that I really like your
> proposal, I would take it in my todo branch and submit it when net-next
> window opens.

If you want that, I have a pile of patches to prandom I really
should push upstream.  Shall I refresh them and send them to you?


commit 4cf1b3d9f4fbccc29ffc2fe4ca4ff52ea77253f1
Author: George Spelvin <li...@horizon.com>
Date:   Mon Aug 31 00:05:00 2015 -0400

net: Use prandom_u32_max()

It's slightly more efficient than "prandom_u32() % range"

The net/802 code was already efficient, but prandom_u32_max() is simpler.

In net/batman-adv/bat_iv_ogm.c, batadv_iv_ogm_fwd_send_time() got changed
from picking a random number of milliseconds and converting to jiffies to
picking a random number of jiffies, since the number of milliseconds (and
thus the conversion to jiffies) is a compile-time constant.  The equivalent
code in batadv_iv_ogm_emit_send_time was not changed, because the number
of milliseconds is variable.

In net/ipv6/ip6_flowlabel.c, ip6_flowlabel had htonl(prandom_u32()),
which is silly.  Just cast to __be32 without permuting the bits.

net/sched/sch_netem.c got adjusted to only need one call to prandom_u32
instead of 2.  (Assuming skb_headlen can't exceed 512 MiB, which is
hopefully safe for some time yet.)

Signed-off-by: George Spelvin <li...@horizon.com>

commit 9c8fb80e1fd2be42c35cab1af27187d600fd85e3
Author: George Spelvin <li...@horizon.com>
Date:   Sat May 24 15:20:47 2014 -0400

mm/swapfile.c: Use prandom_u32_max()

It's slightly more efficient than "prandom_u32() % range"

    Signed-off-by: George Spelvin <li...@horizon.com>

commit 2743eb01e5c5958fd88ae78d19c5fea772d4b117
Author: George Spelvin <li...@horizon.com>
Date:   Sat May 24 15:19:53 2014 -0400

    lib: Use prandom_u32_max()

It's slightly more efficient than "prandom_u32() % range"

Signed-off-by: George Spelvin <li...@horizon.com>

commit 6a5e91bf395060a3351bfe5efc40ac20ffba2c1b
Author: George Spelvin <li...@horizon.com>
Date:   Sat May 24 15:18:50 2014 -0400

fs/xfs: Use prandom_u32_max()

It's slightly more efficient than "prandom_u32() % range".

Also changed the last argument of xfs_error_test() from "unsigned long"
to "unsigned", since the code never did support values > 2^32, and
the largest value ever passed is 100.

The code could be improved even further by passing in 2^32/rf rather
than rf, but I'll leave that to some XFS developers.

Signed-of

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-23 Thread George Spelvin
Hannes Frederic Sowa wrote:
> On 24.12.2016 00:39, George Spelvin wrote:
>> We just finished discussing why 8 bytes isn't enough.  If you only
>> feed back 8 bytes, an attacker who can do 2^64 computation can find it
>> (by guessing and computing forward to verify the guess) and recover the
>> previous state.  You need to feed back at least as much output as your
>> security targete.  For /dev/urandom's ChaCha20, that's 256 bits.

> I followed the discussion but it appeared to me that this has the
> additional constraint of time until the next reseeding event happenes,
> which is 300s (under the assumption that calls to get_random_int happen
> regularly, which I expect right now). After that the existing reseeding
> mechansim will ensure enough backtracking protection. The number of
> bytes can easily be increased here, given that reseeding was shown to be
> quite fast already and we produce enough output. But I am not sure if
> this is a bit overengineered in the end?

I'm not following your description of how the time-based and call-based
mechanisms interact, but for any mix-back, you should either do enough
or none at all.  (Also called "catastrophic reseeding".)

For example, two mix-backs of 64 bits gives you 65 bit security, not 128.
(Because each mixback can be guessed and verified separately.)

> Also agreed. Given your solution below to prandom_u32, I do think it
> might also work without the seqlock now.

It's not technically a seqlock; in particular the reader doesn't
spin.  But the write side, and general logic is so similar it's
a good mental model.

Basically, assume a 64-byte buffer.  The reader has gone through
32 bytes of it, and has 32 left, and when he reads another 8 bytes,
has to distinguish three cases:

1) No update; we read the old bytes and there are now 32 - 24 bytes left.
2) Update completed while we weren't looking.  There are now new
   bytes in the buffer, and we took 8 leaving 64 - 8 = 56.
3) Update in progress at the time of the read.  We don't know if we
   are seeing old bytes or new bytes, so we have to assume the worst
   and not proceeed unless 32 >= 8, but assume at the end there are
   64 - 8 = 56 new bytes left.

> I wouldn't have added a disable irqs, but given that I really like your
> proposal, I would take it in my todo branch and submit it when net-next
> window opens.

If you want that, I have a pile of patches to prandom I really
should push upstream.  Shall I refresh them and send them to you?


commit 4cf1b3d9f4fbccc29ffc2fe4ca4ff52ea77253f1
Author: George Spelvin 
Date:   Mon Aug 31 00:05:00 2015 -0400

net: Use prandom_u32_max()

It's slightly more efficient than "prandom_u32() % range"

The net/802 code was already efficient, but prandom_u32_max() is simpler.

In net/batman-adv/bat_iv_ogm.c, batadv_iv_ogm_fwd_send_time() got changed
from picking a random number of milliseconds and converting to jiffies to
picking a random number of jiffies, since the number of milliseconds (and
thus the conversion to jiffies) is a compile-time constant.  The equivalent
code in batadv_iv_ogm_emit_send_time was not changed, because the number
of milliseconds is variable.

In net/ipv6/ip6_flowlabel.c, ip6_flowlabel had htonl(prandom_u32()),
which is silly.  Just cast to __be32 without permuting the bits.

net/sched/sch_netem.c got adjusted to only need one call to prandom_u32
instead of 2.  (Assuming skb_headlen can't exceed 512 MiB, which is
hopefully safe for some time yet.)
    
    Signed-off-by: George Spelvin 

commit 9c8fb80e1fd2be42c35cab1af27187d600fd85e3
Author: George Spelvin 
Date:   Sat May 24 15:20:47 2014 -0400

mm/swapfile.c: Use prandom_u32_max()
    
It's slightly more efficient than "prandom_u32() % range"

    Signed-off-by: George Spelvin 

commit 2743eb01e5c5958fd88ae78d19c5fea772d4b117
Author: George Spelvin 
Date:   Sat May 24 15:19:53 2014 -0400

lib: Use prandom_u32_max()

It's slightly more efficient than "prandom_u32() % range"

Signed-off-by: George Spelvin 

commit 6a5e91bf395060a3351bfe5efc40ac20ffba2c1b
Author: George Spelvin 
Date:   Sat May 24 15:18:50 2014 -0400

fs/xfs: Use prandom_u32_max()

It's slightly more efficient than "prandom_u32() % range".

Also changed the last argument of xfs_error_test() from "unsigned long"
to "unsigned", since the code never did support values > 2^32, and
the largest value ever passed is 100.

The code could be improved even further by passing in 2^32/rf rather
than rf, but I'll leave that to some XFS developers.

Signed-off-by: George Spelvin 

commit 6f6d485d9179ca6ec4e30caa06ade0e0c6931810
Author: George Spelvin 
Date:   Sat May 24 15:00:17 2014 -0400

fs/ubifs: Use prandom_u32_max()

It's sl

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-23 Thread George Spelvin
Hannes Frederic Sowa wrote:
> In general this looks good, but bitbuffer needs to be protected from
> concurrent access, thus needing at least some atomic instruction and
> disabling of interrupts for the locking if done outside of
> get_random_long. Thus I liked your previous approach more where you
> simply embed this in the already necessary get_random_long and aliased
> get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO.

It's meant to be part of the same approach, and I didn't include locking
because that's a requirement for *any* solution, and isn't specific
to the part I was trying to illustrate.

(As for re-using the name "get_random_long", that was just so
I didn't have to explain it.  Call it "get_random_long_internal"
if you like.)

Possible locking implementations include:
1) Use the same locking as applies to get_random_long_internal(), or
2) Make bitbuffer a per-CPU variable (note that we currently have 128
   bits of per-CPU state in get_random_int_hash[]), and this is all a
   fast-path to bypass heavier locking in get_random_long_internal().

>> But, I just realized I've been overlooking something glaringly obvious...
>> there's no reason you can't compute multple blocks in advance.
>
> In the current code on the cost of per cpu allocations thus memory.

Yes, but on 4-core machines it's still not much, and 4096-core
behemoths have RAM to burn.

> In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte
> return block to feed it directly back into the state chacha? So we pass
> on 56 bytes into the pcpu buffer, and consume 8 bytes for the next
> state. This would make the window max shorter than the anti
> backtracking protection right now from 300s to 14 get_random_int calls.
> Not sure if this is worth it.

We just finished discussing why 8 bytes isn't enough.  If you only
feed back 8 bytes, an attacker who can do 2^64 computation can find it
(by guessing and computing forward to verify the guess) and recover the
previous state.  You need to feed back at least as much output as your
security targete.  For /dev/urandom's ChaCha20, that's 256 bits.

>> For example, suppose we gave each CPU a small pool to minimize locking.
>> When one runs out and calls the global refill, it could refill *all*
>> of the CPU pools.  (You don't even need locking; there may be a race to
>> determine *which* random numbers the reader sees, but they're equally
>> good either way.)

> Yes, but still disabled interrupts, otherwise the same state could be
> used twice on the same CPU. Uff, I think we have this problem in
> prandom_u32.

There are some locking gotchas, but it is doable lock-free.

Basically, it's a seqlock.  The writer increments it once (to an odd
number) before starting to overwrite the buffer, and a second time (to
an even number) after.  "Before" and "after" mean smp_wmb().

The reader can use this to figure out how much of the data in the buffer
is safely fresh.  The full sequence of checks is a bit intricate,
but straightforward.

I didn't discuss the locking because I'm confident it's solvable,
not because I wasn't aware it has to be solved.

As for prandom_u32(), what's the problem?  Are you worried that
get_cpu_var disables preemption but not interrupts, and so an
ISR might return the same value as process-level code?

First of all, that's not a problem because prandom_u32() doesn't
have security guarantees.  Occasionally returning a dupicate number
is okay.

Second, if you do care, that could be trivially fixed by throwing
a barrier() in the middle of the code.  (Patch appended; S-o-B
if anyone wants it.)


diff --git a/lib/random32.c b/lib/random32.c
index c750216d..6bee4a36 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -55,16 +55,29 @@ static DEFINE_PER_CPU(struct rnd_state, net_rand_state);
  *
  * This is used for pseudo-randomness with no outside seeding.
  * For more random results, use prandom_u32().
+ *
+ * The barrier() is to allow prandom_u32() to be called from interupt
+ * context without locking.  An interrupt will run to completion,
+ * updating all four state variables.  The barrier() ensures that
+ * the interrupted code will compute a different result.  Either it
+ * will have written s1 and s2 (so the interrupt will start with
+ * the updated values), or it will use the values of s3 and s4
+ * updated by the interrupt.
+ *
+ * (The same logic applies recursively to nested interrupts, trap
+ * handlers, and NMIs.)
  */
 u32 prandom_u32_state(struct rnd_state *state)
 {
+   register u32 x;
 #define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
-   state->s1 = TAUSWORTHE(state->s1,  6, 13, 4294967294U, 18U);
-   state->s2 = TAUSWORTHE(state->s2,  2, 27, 4294967288U,  2U);
-   state->s3 = TAUSWORTHE(state->s3, 13, 21, 4294967280U,  7U);
-   state->s4 = TAUSWORTHE(state->s4,  3, 12, 4294967168U, 13U);
+   x  = state->s1 = TAUSWORTHE(state->s1,  6, 13,   

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-23 Thread George Spelvin
Hannes Frederic Sowa wrote:
> In general this looks good, but bitbuffer needs to be protected from
> concurrent access, thus needing at least some atomic instruction and
> disabling of interrupts for the locking if done outside of
> get_random_long. Thus I liked your previous approach more where you
> simply embed this in the already necessary get_random_long and aliased
> get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO.

It's meant to be part of the same approach, and I didn't include locking
because that's a requirement for *any* solution, and isn't specific
to the part I was trying to illustrate.

(As for re-using the name "get_random_long", that was just so
I didn't have to explain it.  Call it "get_random_long_internal"
if you like.)

Possible locking implementations include:
1) Use the same locking as applies to get_random_long_internal(), or
2) Make bitbuffer a per-CPU variable (note that we currently have 128
   bits of per-CPU state in get_random_int_hash[]), and this is all a
   fast-path to bypass heavier locking in get_random_long_internal().

>> But, I just realized I've been overlooking something glaringly obvious...
>> there's no reason you can't compute multple blocks in advance.
>
> In the current code on the cost of per cpu allocations thus memory.

Yes, but on 4-core machines it's still not much, and 4096-core
behemoths have RAM to burn.

> In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte
> return block to feed it directly back into the state chacha? So we pass
> on 56 bytes into the pcpu buffer, and consume 8 bytes for the next
> state. This would make the window max shorter than the anti
> backtracking protection right now from 300s to 14 get_random_int calls.
> Not sure if this is worth it.

We just finished discussing why 8 bytes isn't enough.  If you only
feed back 8 bytes, an attacker who can do 2^64 computation can find it
(by guessing and computing forward to verify the guess) and recover the
previous state.  You need to feed back at least as much output as your
security targete.  For /dev/urandom's ChaCha20, that's 256 bits.

>> For example, suppose we gave each CPU a small pool to minimize locking.
>> When one runs out and calls the global refill, it could refill *all*
>> of the CPU pools.  (You don't even need locking; there may be a race to
>> determine *which* random numbers the reader sees, but they're equally
>> good either way.)

> Yes, but still disabled interrupts, otherwise the same state could be
> used twice on the same CPU. Uff, I think we have this problem in
> prandom_u32.

There are some locking gotchas, but it is doable lock-free.

Basically, it's a seqlock.  The writer increments it once (to an odd
number) before starting to overwrite the buffer, and a second time (to
an even number) after.  "Before" and "after" mean smp_wmb().

The reader can use this to figure out how much of the data in the buffer
is safely fresh.  The full sequence of checks is a bit intricate,
but straightforward.

I didn't discuss the locking because I'm confident it's solvable,
not because I wasn't aware it has to be solved.

As for prandom_u32(), what's the problem?  Are you worried that
get_cpu_var disables preemption but not interrupts, and so an
ISR might return the same value as process-level code?

First of all, that's not a problem because prandom_u32() doesn't
have security guarantees.  Occasionally returning a dupicate number
is okay.

Second, if you do care, that could be trivially fixed by throwing
a barrier() in the middle of the code.  (Patch appended; S-o-B
if anyone wants it.)


diff --git a/lib/random32.c b/lib/random32.c
index c750216d..6bee4a36 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -55,16 +55,29 @@ static DEFINE_PER_CPU(struct rnd_state, net_rand_state);
  *
  * This is used for pseudo-randomness with no outside seeding.
  * For more random results, use prandom_u32().
+ *
+ * The barrier() is to allow prandom_u32() to be called from interupt
+ * context without locking.  An interrupt will run to completion,
+ * updating all four state variables.  The barrier() ensures that
+ * the interrupted code will compute a different result.  Either it
+ * will have written s1 and s2 (so the interrupt will start with
+ * the updated values), or it will use the values of s3 and s4
+ * updated by the interrupt.
+ *
+ * (The same logic applies recursively to nested interrupts, trap
+ * handlers, and NMIs.)
  */
 u32 prandom_u32_state(struct rnd_state *state)
 {
+   register u32 x;
 #define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
-   state->s1 = TAUSWORTHE(state->s1,  6, 13, 4294967294U, 18U);
-   state->s2 = TAUSWORTHE(state->s2,  2, 27, 4294967288U,  2U);
-   state->s3 = TAUSWORTHE(state->s3, 13, 21, 4294967280U,  7U);
-   state->s4 = TAUSWORTHE(state->s4,  3, 12, 4294967168U, 13U);
+   x  = state->s1 = TAUSWORTHE(state->s1,  6, 13,   

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-23 Thread George Spelvin
(Cc: list trimmed slightly as the topic is wandering a bit.)

Hannes Frederic Sowa wrote:
> On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote:
>> Adding might_lock() annotations will improve coverage a lot.
>
> Might be hard to find the correct lock we take later down the code
> path, but if that is possible, certainly.

The point of might_lock() is that you don't have to.  You find the
worst case (most global) lock that the code *might* take if all the
buffer-empty conditions are true, and tell lockdep "assume this lock is
taken all the time".

>> Hannes Frederic Sowa wrote:
>>> Yes, that does look nice indeed. Accounting for bits instead of bytes
>>> shouldn't be a huge problem either. Maybe it gets a bit more verbose in
>>> case you can't satisfy a request with one batched entropy block and have
>>> to consume randomness from two.

For example, here's a simple bit-buffer implementation that wraps around
a get_random_long.  The bitbuffer is of the form "1", where the
x bits are valid, and the position of the msbit indicates how many bits
are valid.

extern unsigned long get_random_long();
static unsigned long bitbuffer = 1; /* Holds 0..BITS_PER_LONG-1 bits */
unsigned long get_random_bits(unsigned char bits)
{
/* We handle bits == BITS_PER_LONG,and not bits == 0 */
unsigned long mask = -1ul >> (BITS_PER_LONG - bits);
unsigned long val;

if (bitbuffer > mask) {
/* Request can be satisfied out of the bit buffer */
val = bitbuffer;
bitbuffer >>= bits;
} else {
/*
 * Not enough bits, but enough room in bitbuffer for the
 * leftovers.  avail < bits, so avail + 64 <= bits + 63.
 */
val = get_random_long();
bitbuffer = bitbuffer << (BITS_PER_LONG - bits)
| val >> 1 >> (bits-1);
}
return val & mask;
}

> When we hit the chacha20 without doing a reseed we only mutate the
> state of chacha, but being an invertible function in its own, a
> proposal would be to mix parts of the chacha20 output back into the
> state, which, as a result, would cause slowdown because we couldn't
> propagate the complete output of the cipher back to the caller (looking
> at the function _extract_crng).

Basically, yes.  Half of the output goes to rekeying itself.

But, I just realized I've been overlooking something glaringly obvious...
there's no reason you can't compute multple blocks in advance.

The standard assumption in antibacktracking is that you'll *notice* the
state capture and stop trusting the random numbers afterward; you just
want the output *before* to be secure.  In other words, cops busting
down the door can't find the session key used in the message you just sent.

So you can compute and store random numbers ahead of need.

This can amortize the antibacktracking as much as you'd like.

For example, suppose we gave each CPU a small pool to minimize locking.
When one runs out and calls the global refill, it could refill *all*
of the CPU pools.  (You don't even need locking; there may be a race to
determine *which* random numbers the reader sees, but they're equally
good either way.)

> Or are you referring that the anti-backtrack protection should happen
> in every call from get_random_int?

If you ask for anti-backtracking without qualification, that's the
goal, since you don't know how long will elapse until the next call.  

It's like fsync().  There are lots of more limited forms of "keep my
data safe in case of a crash", but the most basic one is "if we lost
power the very instant the call returned, the data would be safe."


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-23 Thread George Spelvin
(Cc: list trimmed slightly as the topic is wandering a bit.)

Hannes Frederic Sowa wrote:
> On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote:
>> Adding might_lock() annotations will improve coverage a lot.
>
> Might be hard to find the correct lock we take later down the code
> path, but if that is possible, certainly.

The point of might_lock() is that you don't have to.  You find the
worst case (most global) lock that the code *might* take if all the
buffer-empty conditions are true, and tell lockdep "assume this lock is
taken all the time".

>> Hannes Frederic Sowa wrote:
>>> Yes, that does look nice indeed. Accounting for bits instead of bytes
>>> shouldn't be a huge problem either. Maybe it gets a bit more verbose in
>>> case you can't satisfy a request with one batched entropy block and have
>>> to consume randomness from two.

For example, here's a simple bit-buffer implementation that wraps around
a get_random_long.  The bitbuffer is of the form "1", where the
x bits are valid, and the position of the msbit indicates how many bits
are valid.

extern unsigned long get_random_long();
static unsigned long bitbuffer = 1; /* Holds 0..BITS_PER_LONG-1 bits */
unsigned long get_random_bits(unsigned char bits)
{
/* We handle bits == BITS_PER_LONG,and not bits == 0 */
unsigned long mask = -1ul >> (BITS_PER_LONG - bits);
unsigned long val;

if (bitbuffer > mask) {
/* Request can be satisfied out of the bit buffer */
val = bitbuffer;
bitbuffer >>= bits;
} else {
/*
 * Not enough bits, but enough room in bitbuffer for the
 * leftovers.  avail < bits, so avail + 64 <= bits + 63.
 */
val = get_random_long();
bitbuffer = bitbuffer << (BITS_PER_LONG - bits)
| val >> 1 >> (bits-1);
}
return val & mask;
}

> When we hit the chacha20 without doing a reseed we only mutate the
> state of chacha, but being an invertible function in its own, a
> proposal would be to mix parts of the chacha20 output back into the
> state, which, as a result, would cause slowdown because we couldn't
> propagate the complete output of the cipher back to the caller (looking
> at the function _extract_crng).

Basically, yes.  Half of the output goes to rekeying itself.

But, I just realized I've been overlooking something glaringly obvious...
there's no reason you can't compute multple blocks in advance.

The standard assumption in antibacktracking is that you'll *notice* the
state capture and stop trusting the random numbers afterward; you just
want the output *before* to be secure.  In other words, cops busting
down the door can't find the session key used in the message you just sent.

So you can compute and store random numbers ahead of need.

This can amortize the antibacktracking as much as you'd like.

For example, suppose we gave each CPU a small pool to minimize locking.
When one runs out and calls the global refill, it could refill *all*
of the CPU pools.  (You don't even need locking; there may be a race to
determine *which* random numbers the reader sees, but they're equally
good either way.)

> Or are you referring that the anti-backtrack protection should happen
> in every call from get_random_int?

If you ask for anti-backtracking without qualification, that's the
goal, since you don't know how long will elapse until the next call.  

It's like fsync().  There are lots of more limited forms of "keep my
data safe in case of a crash", but the most basic one is "if we lost
power the very instant the call returned, the data would be safe."


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
Hannes Frederic Sowa wrote:
> A lockdep test should still be done. ;)

Adding might_lock() annotations will improve coverage a lot.

> Yes, that does look nice indeed. Accounting for bits instead of bytes
> shouldn't be a huge problem either. Maybe it gets a bit more verbose in
> case you can't satisfy a request with one batched entropy block and have
> to consume randomness from two.

The bit granularity is also for the callers' convenience, so they don't
have to mask again.  Whether get_random_bits rounds up to byte boundaries
internally or not is something else.

When the current batch runs low, I was actually thinking of throwing
away the remaining bits and computing a new batch of 512.  But it's
whatever works best at implementation time.

>>> It could only mix the output back in every two calls, in which case
>>> you can backtrack up to one call but you need to do 2^128 work to
>>> backtrack farther.  But yes, this is getting excessively complicated.
> 
>> No, if you're willing to accept limited backtrack, this is a perfectly
>> acceptable solution, and not too complicated.  You could do it phase-less
>> if you like; store the previous output, then after generating the new
>> one, mix in both.  Then overwrite the previous output.  (But doing two
>> rounds of a crypto primtive to avoid one conditional jump is stupid,
>> so forget that.)

> Can you quickly explain why we lose the backtracking capability?

Sure.  An RNG is (state[i], output[i]) = f(state[i-1]).  The goal of
backtracking is to compute output[i], or better yet state[i-1], given
state[i].

For example, consider an OFB or CTR mode generator.  The state is a key
and and IV, and you encrypt the IV with the key to produce output, then
either replace the IV with the output, or increment it.  Either way,
since you still have the key, you can invert the transformation and
recover the previous IV.

The standard way around this is to use the Davies-Meyer construction:

IV[i] = IV[i-1] + E(IV[i-1], key)

This is the standard way to make a non-invertible random function
out of an invertible random permutation.

>From the sum, there's no easy way to find the ciphertext *or* the
plaintext that was encrypted.  Assuming the encryption is secure,
the only way to reverse it is brute force: guess IV[i-1] and run the
operation forward to see if the resultant IV[i] matches.

There are a variety of ways to organize this computation, since the
guess gives toy both IV[i-1] and E(IV[i-1], key) = IV[i] - IV[i-1], including
running E forward, backward, or starting from both ends to see if you
meet in the middle.

The way you add the encryption output to the IV is not very important.
It can be addition, xor, or some more complex invertible transformation.
In the case of SipHash, the "encryption" output is smaller than the
input, so we have to get a bit more creative, but it's still basically
the same thing.

The problem is that the output which is combined with the IV is too small.
With only 64 bits, trying all possible values is practical.  (The world's
Bitcoin miners are collectively computing SHA-256(SHA-256(input)) 1.7 * 2^64
times per second.)

By basically doing two iterations at once and mixing in 128 bits of
output, the guessing attack is rendered impractical.  The only downside
is that you need to remember and store one result between when it's
computed and last used.  This is part of the state, so an attack can
find output[i-1], but not anything farther back.

> ChaCha as a block cipher gives a "perfect" permutation from the output
> of either the CRNG or the CPRNG, which actually itself has backtracking
> protection.

I'm not quite understanding.  The /dev/random implementation uses some
of the ChaCha output as a new ChaCha key (that's another way to mix output
back into the state) to prevent backtracking.  But this slows it down, and
again if you want to be efficient, you're generating and storing large batches
of entropy and storing it in the RNG state.


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
Hannes Frederic Sowa wrote:
> A lockdep test should still be done. ;)

Adding might_lock() annotations will improve coverage a lot.

> Yes, that does look nice indeed. Accounting for bits instead of bytes
> shouldn't be a huge problem either. Maybe it gets a bit more verbose in
> case you can't satisfy a request with one batched entropy block and have
> to consume randomness from two.

The bit granularity is also for the callers' convenience, so they don't
have to mask again.  Whether get_random_bits rounds up to byte boundaries
internally or not is something else.

When the current batch runs low, I was actually thinking of throwing
away the remaining bits and computing a new batch of 512.  But it's
whatever works best at implementation time.

>>> It could only mix the output back in every two calls, in which case
>>> you can backtrack up to one call but you need to do 2^128 work to
>>> backtrack farther.  But yes, this is getting excessively complicated.
> 
>> No, if you're willing to accept limited backtrack, this is a perfectly
>> acceptable solution, and not too complicated.  You could do it phase-less
>> if you like; store the previous output, then after generating the new
>> one, mix in both.  Then overwrite the previous output.  (But doing two
>> rounds of a crypto primtive to avoid one conditional jump is stupid,
>> so forget that.)

> Can you quickly explain why we lose the backtracking capability?

Sure.  An RNG is (state[i], output[i]) = f(state[i-1]).  The goal of
backtracking is to compute output[i], or better yet state[i-1], given
state[i].

For example, consider an OFB or CTR mode generator.  The state is a key
and and IV, and you encrypt the IV with the key to produce output, then
either replace the IV with the output, or increment it.  Either way,
since you still have the key, you can invert the transformation and
recover the previous IV.

The standard way around this is to use the Davies-Meyer construction:

IV[i] = IV[i-1] + E(IV[i-1], key)

This is the standard way to make a non-invertible random function
out of an invertible random permutation.

>From the sum, there's no easy way to find the ciphertext *or* the
plaintext that was encrypted.  Assuming the encryption is secure,
the only way to reverse it is brute force: guess IV[i-1] and run the
operation forward to see if the resultant IV[i] matches.

There are a variety of ways to organize this computation, since the
guess gives toy both IV[i-1] and E(IV[i-1], key) = IV[i] - IV[i-1], including
running E forward, backward, or starting from both ends to see if you
meet in the middle.

The way you add the encryption output to the IV is not very important.
It can be addition, xor, or some more complex invertible transformation.
In the case of SipHash, the "encryption" output is smaller than the
input, so we have to get a bit more creative, but it's still basically
the same thing.

The problem is that the output which is combined with the IV is too small.
With only 64 bits, trying all possible values is practical.  (The world's
Bitcoin miners are collectively computing SHA-256(SHA-256(input)) 1.7 * 2^64
times per second.)

By basically doing two iterations at once and mixing in 128 bits of
output, the guessing attack is rendered impractical.  The only downside
is that you need to remember and store one result between when it's
computed and last used.  This is part of the state, so an attack can
find output[i-1], but not anything farther back.

> ChaCha as a block cipher gives a "perfect" permutation from the output
> of either the CRNG or the CPRNG, which actually itself has backtracking
> protection.

I'm not quite understanding.  The /dev/random implementation uses some
of the ChaCha output as a new ChaCha key (that's another way to mix output
back into the state) to prevent backtracking.  But this slows it down, and
again if you want to be efficient, you're generating and storing large batches
of entropy and storing it in the RNG state.


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
> I do tend to like Ted's version in which we use batched
> get_random_bytes() output.  If it's fast enough, it's simpler and lets
> us get the full strength of a CSPRNG.

With the ChaCha20 generator, that's fine, although note that this abandons
anti-backtracking entirely.

It also takes locks, something the previous get_random_int code
path avoided.  Do we need to audit the call sites to ensure that's safe?

And there is the issue that the existing callers assume that there's a
fixed cost per word.  A good half of get_random_long calls are followed by
"& ~PAGE_MASK" to extract the low 12 bits.  Or "& ((1ul << mmap_rnd_bits)
- 1)" to extract the low 28.  If we have a buffer we're going to have to
pay to refill, it would be nice to use less than 8 bytes to satisfy those.

But that can be a followup patch.  I'm thinking

unsigned long get_random_bits(unsigned bits)
E.g. get_random_bits(PAGE_SHIFT),
 get_random_bits(mmap_rnd_bits),
u32 imm_rnd = get_random_bits(32)

unsigned get_random_mod(unsigned modulus)
E.g. get_random_mod(hole) & ~(alignment - 1);
 get_random_mod(port_scan_backoff)
(Althogh probably drivers/s390/scsi/zfcp_fc.c should be changed
to prandom.)

with, until the audit is completed:
#define get_random_int() get_random_bits(32)
#define get_random_long() get_random_bits(BITS_PER_LONG)

> It could only mix the output back in every two calls, in which case
> you can backtrack up to one call but you need to do 2^128 work to
> backtrack farther.  But yes, this is getting excessively complicated.

No, if you're willing to accept limited backtrack, this is a perfectly
acceptable solution, and not too complicated.  You could do it phase-less
if you like; store the previous output, then after generating the new
one, mix in both.  Then overwrite the previous output.  (But doing two
rounds of a crypto primtive to avoid one conditional jump is stupid,
so forget that.)

>> Hmm, interesting.  Although, for ASLR, we could use get_random_bytes()
>> directly and be done with it.  It won't be a bottleneck.

Isn't that what you already suggested?

I don't mind fewer primtives; I got a bit fixated on "Replace MD5 with
SipHash".  It's just the locking that I want to check isn't a problem.


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
> I do tend to like Ted's version in which we use batched
> get_random_bytes() output.  If it's fast enough, it's simpler and lets
> us get the full strength of a CSPRNG.

With the ChaCha20 generator, that's fine, although note that this abandons
anti-backtracking entirely.

It also takes locks, something the previous get_random_int code
path avoided.  Do we need to audit the call sites to ensure that's safe?

And there is the issue that the existing callers assume that there's a
fixed cost per word.  A good half of get_random_long calls are followed by
"& ~PAGE_MASK" to extract the low 12 bits.  Or "& ((1ul << mmap_rnd_bits)
- 1)" to extract the low 28.  If we have a buffer we're going to have to
pay to refill, it would be nice to use less than 8 bytes to satisfy those.

But that can be a followup patch.  I'm thinking

unsigned long get_random_bits(unsigned bits)
E.g. get_random_bits(PAGE_SHIFT),
 get_random_bits(mmap_rnd_bits),
u32 imm_rnd = get_random_bits(32)

unsigned get_random_mod(unsigned modulus)
E.g. get_random_mod(hole) & ~(alignment - 1);
 get_random_mod(port_scan_backoff)
(Althogh probably drivers/s390/scsi/zfcp_fc.c should be changed
to prandom.)

with, until the audit is completed:
#define get_random_int() get_random_bits(32)
#define get_random_long() get_random_bits(BITS_PER_LONG)

> It could only mix the output back in every two calls, in which case
> you can backtrack up to one call but you need to do 2^128 work to
> backtrack farther.  But yes, this is getting excessively complicated.

No, if you're willing to accept limited backtrack, this is a perfectly
acceptable solution, and not too complicated.  You could do it phase-less
if you like; store the previous output, then after generating the new
one, mix in both.  Then overwrite the previous output.  (But doing two
rounds of a crypto primtive to avoid one conditional jump is stupid,
so forget that.)

>> Hmm, interesting.  Although, for ASLR, we could use get_random_bytes()
>> directly and be done with it.  It won't be a bottleneck.

Isn't that what you already suggested?

I don't mind fewer primtives; I got a bit fixated on "Replace MD5 with
SipHash".  It's just the locking that I want to check isn't a problem.


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
> Having slept on this, I like it less.  The problem is that a
> backtracking attacker doesn't just learn H(random seed || entropy_0 ||
> secret || ...) -- they learn the internal state of the hash function
> that generates that value.  This probably breaks any attempt to apply
> security properties of the hash function.  For example, the internal
> state could easily contain a whole bunch of prior outputs it in
> verbatim.

The problem is, anti-backtracing is in severe tension with your desire
to use unmodified SipHash.

First of all, I'd like to repeat that it isn't a design goal of the current
generator and isn't necessary.

The current generator just returns hash[0] from the MD5 state, then
leaves the state stored.  The fact that it conceals earlier outputs is
an accident of the Davies-Meyer structure of md5_transform.

It isn't necessary, because every secret generated is stored unencrypted
for as long as it's of value.  A few values are used for retransmit
backoffs and random MAC addresses.  Those are revealed to the world as
soon as they're used.

Most values are used for ASLR.  These address are of interest to an
attacker trying to mount a buffer-overflow attack, but that only lasts
as long as the process is running to receive buffers.  After the process
exits, knowledge of its layout is worthless.

And this is stored as long as it's running in kernel-accessible data,
so storing a *second* copy in less conveniently kernel-accessible data
(the RNG state) doesn't make the situation any worse.


In addition to the above, if you're assuming a state capture, then
since we have (for necessary efficieny reasons) a negligible about of
fresh entropy, an attacker has the secret key and can predict *future*
outputs very easily.

Given that information, an attacker doesn't need to learn the layout of
vulnerable server X.  If they have a buffer overflow, they can crash
the current instance and wait for a fresh image to be started (with a
known address space) to launch their attack at.


Kernel state capture attacks are a very unlikely attack, mostly because
it's a narrow target a hair's breadth away from the much more interesting
outright kernel compromise (attacker gains write access as well as read)
which renders all this fancy cryptanaysis moot.


Now, the main point:  it's not likely to be solvable.

The problem with unmodified SipHash is that is has only 64 bits of
output.  No mix-back mechanism can get around the fundamental problem
that that's too small to prevent a brute-force guessing attack.  You need
wider mix-back.  And getting more output from unmodified SipHash requires
more finalization rounds, which is expensive.

(Aside: 64 bits does have the advantage that it can't be brute-forced on
the attacked machine.  It must be exfiltrated to the attacker, and the
solution returned to the attack code.  But doing this is getting easier
all the time.)

Adding antibacktracking to SipHash is trivial: just add a Davies-Meyer
structure around its internal state.  Remember the internal state before
hashing in the entropy and secret, generate the output, then add the
previous and final states together for storage.

This is a standard textbook construction, very cheap, and doesn't touch
the compression function which is the target of analysis and attacks,
but it requires poking around in SipHash's internal state.  (And people
who read the textbooks without understanding them will get upset because
the textbooks all talk about using this construction with block ciphers,
and SipHash's compression function is not advertised as a block cipher.)

Alternative designs exist; you could extract additional output from
earlier rounds of SipHash, using the duplex sponge construction you
mentioned earlier.  That output would be used for mixback purposes *only*,
so wouldn't affect the security proof of the "primary" output.
But this is also getting creative with SipHash's internals.


Now, you could use a completely *different* cryptographic primitive
to enforce one-way-ness, and apply SipHash as a strong output transform,
but that doesn't feel like good design, and is probably more expensive.


Finally, your discomfort about an attacker learning the internal state...
if an attacker knows the key and the input, they can construct the
internal state.  Yes, we could discard the internal state and construct
a fresh one on the next call to get_random_int, but what are you going
to key it with?  What are you going to feed it?  What keeps *that*
internal state any more secret from an attacker than one that's explicitly
stored?

Keeping the internal state around is a cacheing optimization, that's all.

*If* you're assuming a state capture, the only thing secret from the
attacker is any fresh entropy collected between the time of capture
and the time of generation.  Due to mandatory efficiency requirements,
this is very small.  

I really think you're wishing for the impossible here.


A final note: although I'm disagreeing with you, 

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
> Having slept on this, I like it less.  The problem is that a
> backtracking attacker doesn't just learn H(random seed || entropy_0 ||
> secret || ...) -- they learn the internal state of the hash function
> that generates that value.  This probably breaks any attempt to apply
> security properties of the hash function.  For example, the internal
> state could easily contain a whole bunch of prior outputs it in
> verbatim.

The problem is, anti-backtracing is in severe tension with your desire
to use unmodified SipHash.

First of all, I'd like to repeat that it isn't a design goal of the current
generator and isn't necessary.

The current generator just returns hash[0] from the MD5 state, then
leaves the state stored.  The fact that it conceals earlier outputs is
an accident of the Davies-Meyer structure of md5_transform.

It isn't necessary, because every secret generated is stored unencrypted
for as long as it's of value.  A few values are used for retransmit
backoffs and random MAC addresses.  Those are revealed to the world as
soon as they're used.

Most values are used for ASLR.  These address are of interest to an
attacker trying to mount a buffer-overflow attack, but that only lasts
as long as the process is running to receive buffers.  After the process
exits, knowledge of its layout is worthless.

And this is stored as long as it's running in kernel-accessible data,
so storing a *second* copy in less conveniently kernel-accessible data
(the RNG state) doesn't make the situation any worse.


In addition to the above, if you're assuming a state capture, then
since we have (for necessary efficieny reasons) a negligible about of
fresh entropy, an attacker has the secret key and can predict *future*
outputs very easily.

Given that information, an attacker doesn't need to learn the layout of
vulnerable server X.  If they have a buffer overflow, they can crash
the current instance and wait for a fresh image to be started (with a
known address space) to launch their attack at.


Kernel state capture attacks are a very unlikely attack, mostly because
it's a narrow target a hair's breadth away from the much more interesting
outright kernel compromise (attacker gains write access as well as read)
which renders all this fancy cryptanaysis moot.


Now, the main point:  it's not likely to be solvable.

The problem with unmodified SipHash is that is has only 64 bits of
output.  No mix-back mechanism can get around the fundamental problem
that that's too small to prevent a brute-force guessing attack.  You need
wider mix-back.  And getting more output from unmodified SipHash requires
more finalization rounds, which is expensive.

(Aside: 64 bits does have the advantage that it can't be brute-forced on
the attacked machine.  It must be exfiltrated to the attacker, and the
solution returned to the attack code.  But doing this is getting easier
all the time.)

Adding antibacktracking to SipHash is trivial: just add a Davies-Meyer
structure around its internal state.  Remember the internal state before
hashing in the entropy and secret, generate the output, then add the
previous and final states together for storage.

This is a standard textbook construction, very cheap, and doesn't touch
the compression function which is the target of analysis and attacks,
but it requires poking around in SipHash's internal state.  (And people
who read the textbooks without understanding them will get upset because
the textbooks all talk about using this construction with block ciphers,
and SipHash's compression function is not advertised as a block cipher.)

Alternative designs exist; you could extract additional output from
earlier rounds of SipHash, using the duplex sponge construction you
mentioned earlier.  That output would be used for mixback purposes *only*,
so wouldn't affect the security proof of the "primary" output.
But this is also getting creative with SipHash's internals.


Now, you could use a completely *different* cryptographic primitive
to enforce one-way-ness, and apply SipHash as a strong output transform,
but that doesn't feel like good design, and is probably more expensive.


Finally, your discomfort about an attacker learning the internal state...
if an attacker knows the key and the input, they can construct the
internal state.  Yes, we could discard the internal state and construct
a fresh one on the next call to get_random_int, but what are you going
to key it with?  What are you going to feed it?  What keeps *that*
internal state any more secret from an attacker than one that's explicitly
stored?

Keeping the internal state around is a cacheing optimization, that's all.

*If* you're assuming a state capture, the only thing secret from the
attacker is any fresh entropy collected between the time of capture
and the time of generation.  Due to mandatory efficiency requirements,
this is very small.  

I really think you're wishing for the impossible here.


A final note: although I'm disagreeing with you, 

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
> True, but it's called get_random_int(), and it seems like making it
> stronger, especially if the performance cost is low to zero, is a good
> thing.

If it's cheap enough, I don't mind.  But it's documented as being
marginal-quality, used where speed is more important.

In particular, it's *not* used for key material; only values that matter
only as long as they are in use.  Whule they're in use, they can't be
concealed from an attacker with kernel access, and when they're dne
being used, they're worthless.

>> If you want anti-backtracking, though, it's easy to add.  What we
>> hash is:
>>
>> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>>
>> You mix the output word right back in to the (unfinalized) state after
>> generating it.  This is still equivalent to unmodified back-box SipHash,
>> you're just using a (conceptually independent) SipHash invocation to
>> produce some of its input.

> Ah, cute.  This could probably be sped up by doing something like:
>
> entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

I'm disinclined to do that because that requires deferring the mixing
until the *next* time you generate something.  Storing the value you
don't want revealed by a state capture defeats the purpose.

I'd rather mix it in immediately, so you have anti-backtracking from
the moment of creation.

Also, I don't think it's particularly "cute" or clever; mixing output back
in is the standard way all antibacktracking is accomplished.  It's how
the Davies-Meyer hash construction makes a reversible function one-way.

(There is a second way to do it by throwing away state, but that's
expensive in seed entropy.)

> It's a little weak because the output is only 64 bits, so you could
> plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
> old entropy is guessable.  I suspect there are sneaky ways around it
> like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

Ah, yes, I see.  Given the final state, you guess the output word, go
back one round, then forward the finalization rounds.   Is the output
equal to the guessed output?  You'll find the true value, plus
Poisson(1 - 2^-64) additional.  (Since you have 2^64-1 chances at
something with probability 1 in 2^64.)

And this can be iterated as often as you like to get earlier output words,
as long as you can guess the entropy.  *That's* the part that hurts;
you'd like something that peters out.

You could use the double-length-output SipHash variant (which requires
a second set of finalization rounds) and mix more output back, but
that's expensive.

The challenge is coming up with more unpredictable data to mix in than one
invocation of SipHash returns.  And without storing previous output
anywhere, because that is exactly wrong.

A running sum or xor or whatever of the outputs doesn't help, because
once you've guessed the last output, you can backtrack that for no
additional effort.

State capture is incredibly difficult, our application doesn't require
resistance anyway... unless you can think of something cheap, I think
we can just live with this.

>> I'd *like* to persuade you that skipping the padding byte wouldn't
>> invalidate any security proofs, because it's true and would simplify
>> the code.  But if you want 100% stock, I'm willing to cater to that.

> I lean toward stock in the absence of a particularly good reason.  At
> the very least I'd want to read that paper carefully.

Er... adding the length is standard Merkle-Damgaard strengthening.
Why you do this is described in the original papers by Merkle and Damgaard.

The lay summary is at
https://en.wikipedia.org/wiki/Merkle-Damgard_construction

The original sources are:
http://www.merkle.com/papers/Thesis1979.pdf
http://saluc.engr.uconn.edu/refs/algorithms/hashalg/damgard89adesign.pdf

Merkle describes the construction; Damgaard proves it secure.  Basically,
appending the length is required to handle variable-length input if the
input is not itself self-delimiting.

The proof of security is theorem 3.1 in the latter.  (The first, more
detailed explanation involves the use of an extra bit, which the second
then explains how todo without.)

In particular, see the top of page 420, which notes that the security
proof only requires encoding *how much padding is added* in the final
block, not the overall length of the message, and the second remark on
p. 421 which notes that no such suffix is required if it's not necessary
to distinguish messages with different numbers of trailing null bytes.

The rules are alluded to in the "Choice of padding rule" part of the
"Rationale" section of the SipHash paper (p. 7), but the description is
very brief because it assumes the reader has the background.

That's why they say "We could have chosen a slightly simpler padding rule,
such as appending a 80 byte followed by zeroes."

The thing is, if the amount of the last block that is used is fixed
(within the domain of a particular key), you don't 

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread George Spelvin
> True, but it's called get_random_int(), and it seems like making it
> stronger, especially if the performance cost is low to zero, is a good
> thing.

If it's cheap enough, I don't mind.  But it's documented as being
marginal-quality, used where speed is more important.

In particular, it's *not* used for key material; only values that matter
only as long as they are in use.  Whule they're in use, they can't be
concealed from an attacker with kernel access, and when they're dne
being used, they're worthless.

>> If you want anti-backtracking, though, it's easy to add.  What we
>> hash is:
>>
>> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>>
>> You mix the output word right back in to the (unfinalized) state after
>> generating it.  This is still equivalent to unmodified back-box SipHash,
>> you're just using a (conceptually independent) SipHash invocation to
>> produce some of its input.

> Ah, cute.  This could probably be sped up by doing something like:
>
> entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

I'm disinclined to do that because that requires deferring the mixing
until the *next* time you generate something.  Storing the value you
don't want revealed by a state capture defeats the purpose.

I'd rather mix it in immediately, so you have anti-backtracking from
the moment of creation.

Also, I don't think it's particularly "cute" or clever; mixing output back
in is the standard way all antibacktracking is accomplished.  It's how
the Davies-Meyer hash construction makes a reversible function one-way.

(There is a second way to do it by throwing away state, but that's
expensive in seed entropy.)

> It's a little weak because the output is only 64 bits, so you could
> plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
> old entropy is guessable.  I suspect there are sneaky ways around it
> like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

Ah, yes, I see.  Given the final state, you guess the output word, go
back one round, then forward the finalization rounds.   Is the output
equal to the guessed output?  You'll find the true value, plus
Poisson(1 - 2^-64) additional.  (Since you have 2^64-1 chances at
something with probability 1 in 2^64.)

And this can be iterated as often as you like to get earlier output words,
as long as you can guess the entropy.  *That's* the part that hurts;
you'd like something that peters out.

You could use the double-length-output SipHash variant (which requires
a second set of finalization rounds) and mix more output back, but
that's expensive.

The challenge is coming up with more unpredictable data to mix in than one
invocation of SipHash returns.  And without storing previous output
anywhere, because that is exactly wrong.

A running sum or xor or whatever of the outputs doesn't help, because
once you've guessed the last output, you can backtrack that for no
additional effort.

State capture is incredibly difficult, our application doesn't require
resistance anyway... unless you can think of something cheap, I think
we can just live with this.

>> I'd *like* to persuade you that skipping the padding byte wouldn't
>> invalidate any security proofs, because it's true and would simplify
>> the code.  But if you want 100% stock, I'm willing to cater to that.

> I lean toward stock in the absence of a particularly good reason.  At
> the very least I'd want to read that paper carefully.

Er... adding the length is standard Merkle-Damgaard strengthening.
Why you do this is described in the original papers by Merkle and Damgaard.

The lay summary is at
https://en.wikipedia.org/wiki/Merkle-Damgard_construction

The original sources are:
http://www.merkle.com/papers/Thesis1979.pdf
http://saluc.engr.uconn.edu/refs/algorithms/hashalg/damgard89adesign.pdf

Merkle describes the construction; Damgaard proves it secure.  Basically,
appending the length is required to handle variable-length input if the
input is not itself self-delimiting.

The proof of security is theorem 3.1 in the latter.  (The first, more
detailed explanation involves the use of an extra bit, which the second
then explains how todo without.)

In particular, see the top of page 420, which notes that the security
proof only requires encoding *how much padding is added* in the final
block, not the overall length of the message, and the second remark on
p. 421 which notes that no such suffix is required if it's not necessary
to distinguish messages with different numbers of trailing null bytes.

The rules are alluded to in the "Choice of padding rule" part of the
"Rationale" section of the SipHash paper (p. 7), but the description is
very brief because it assumes the reader has the background.

That's why they say "We could have chosen a slightly simpler padding rule,
such as appending a 80 byte followed by zeroes."

The thing is, if the amount of the last block that is used is fixed
(within the domain of a particular key), you don't 

Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread George Spelvin
Andy Lutomirski wrote:
> I don't even think it needs that.  This is just adding a
> non-destructive final operation, right?

It is, but the problem is that SipHash is intended for *small* inputs,
so the standard implementations aren't broken into init/update/final
functions.

There's just one big function that keeps the state variables in
registers and never stores them anywhere.

If we *had* init/update/final functions, then it would be trivial.

> Just to clarify, if we replace SipHash with a black box, I think this
> effectively means, where "entropy" is random_get_entropy() || jiffies
> || current->pid:

> The first call returns H(random seed || entropy_0 || secret).  The
> second call returns H(random seed || entropy_0 || secret || entropy_1
> || secret).  Etc.

Basically, yes.  I was skipping the padding byte and keying the
finalization rounds on the grounds of "can't hurt and might help",
but we could do it a more standard way.

> If not, then I have a fairly strong preference to keep whatever
> construction we come up with consistent with something that could
> actually happen with invocations of unmodified SipHash -- then all the
> security analysis on SipHash goes through.

Okay.  I don't think it makes a difference, but it's not a *big* waste
of time.  If we have finalization rounds, we can reduce the secret
to 128 bits.

If we include the padding byte, we can do one of two things:
1) Make the secret 184 bits, to fill up the final partial word as
   much as possible, or
2) Make the entropy 1 byte smaller and conceptually misalign the
   secret.  What we'd actually do is remove the last byte of
   the secret and include it in the entropy words, but that's
   just a rotation of the secret between storage and hashing.

Also, I assume you'd like SipHash-2-4, since you want to rely
on a security analysis.

(Regarding the padding byte, getting it right might be annoying
to do exactly.  All of the security analysis depends *only* on
its low 3 bits indicating how much of the final block is used.
As it says in the SipHash paper, they included 8 bits just because
it was easy.  But if you want it exact, it's just one more byte of
state.)

> The one thing I don't like is
> that I don't see how to prove that you can't run it backwards if you
> manage to acquire a memory dump.  In fact, I that that there exist, at
> least in theory, hash functions that are secure in the random oracle
> model but that *can* be run backwards given the full state.  From
> memory, SHA-3 has exactly that property, and it would be a bit sad for
> a CSPRNG to be reversible.

Er...  get_random_int() is specifically *not* designed to be resistant
to state capture, and I didn't try.  Remember, what it's used for
is ASLR, what we're worried about is somene learning the layouts
of still-running processes, and and if you get a memory dump, you have
the memory layout!

If you want anti-backtracking, though, it's easy to add.  What we
hash is:

entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...

You mix the output word right back in to the (unfinalized) state after
generating it.  This is still equivalent to unmodified back-box SipHash,
you're just using a (conceptually independent) SipHash invocation to
produce some of its input.

Each output is produced by copying the state, padding & finalizing after the
secret.


In fact, to make our lives easier, let's define the secret to end with
a counter byte that happens to be equal to the padding byte.  The input
stream will be:

Previous output: 8 (or 4 for HalfSipHash) bytes
Entropy: 15 bytes (8 bytes timer, 4 bytes jiffies, 3 bytes pid)
Secret: 16 bytes
Counter: 1 byte
...repeat...

> We could also periodically mix in a big (128-bit?) chunk of fresh
> urandom output to keep the bad guys guessing.

Simpler and faster to just update the global master secret.
The state is per-CPU, so mixing in has to be repeated per CPU.


With these changes, I'm satisifed that it's secure, cheap, has a
sufficiently wide state size, *and* all standard SipHash analysis applies.

The only remaining issues are:
1) How many rounds, and
2) May we use HalfSipHash?

I'd *like* to persuade you that skipping the padding byte wouldn't
invalidate any security proofs, because it's true and would simplify
the code.  But if you want 100% stock, I'm willing to cater to that.

Ted, what do you think?


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread George Spelvin
Andy Lutomirski wrote:
> I don't even think it needs that.  This is just adding a
> non-destructive final operation, right?

It is, but the problem is that SipHash is intended for *small* inputs,
so the standard implementations aren't broken into init/update/final
functions.

There's just one big function that keeps the state variables in
registers and never stores them anywhere.

If we *had* init/update/final functions, then it would be trivial.

> Just to clarify, if we replace SipHash with a black box, I think this
> effectively means, where "entropy" is random_get_entropy() || jiffies
> || current->pid:

> The first call returns H(random seed || entropy_0 || secret).  The
> second call returns H(random seed || entropy_0 || secret || entropy_1
> || secret).  Etc.

Basically, yes.  I was skipping the padding byte and keying the
finalization rounds on the grounds of "can't hurt and might help",
but we could do it a more standard way.

> If not, then I have a fairly strong preference to keep whatever
> construction we come up with consistent with something that could
> actually happen with invocations of unmodified SipHash -- then all the
> security analysis on SipHash goes through.

Okay.  I don't think it makes a difference, but it's not a *big* waste
of time.  If we have finalization rounds, we can reduce the secret
to 128 bits.

If we include the padding byte, we can do one of two things:
1) Make the secret 184 bits, to fill up the final partial word as
   much as possible, or
2) Make the entropy 1 byte smaller and conceptually misalign the
   secret.  What we'd actually do is remove the last byte of
   the secret and include it in the entropy words, but that's
   just a rotation of the secret between storage and hashing.

Also, I assume you'd like SipHash-2-4, since you want to rely
on a security analysis.

(Regarding the padding byte, getting it right might be annoying
to do exactly.  All of the security analysis depends *only* on
its low 3 bits indicating how much of the final block is used.
As it says in the SipHash paper, they included 8 bits just because
it was easy.  But if you want it exact, it's just one more byte of
state.)

> The one thing I don't like is
> that I don't see how to prove that you can't run it backwards if you
> manage to acquire a memory dump.  In fact, I that that there exist, at
> least in theory, hash functions that are secure in the random oracle
> model but that *can* be run backwards given the full state.  From
> memory, SHA-3 has exactly that property, and it would be a bit sad for
> a CSPRNG to be reversible.

Er...  get_random_int() is specifically *not* designed to be resistant
to state capture, and I didn't try.  Remember, what it's used for
is ASLR, what we're worried about is somene learning the layouts
of still-running processes, and and if you get a memory dump, you have
the memory layout!

If you want anti-backtracking, though, it's easy to add.  What we
hash is:

entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...

You mix the output word right back in to the (unfinalized) state after
generating it.  This is still equivalent to unmodified back-box SipHash,
you're just using a (conceptually independent) SipHash invocation to
produce some of its input.

Each output is produced by copying the state, padding & finalizing after the
secret.


In fact, to make our lives easier, let's define the secret to end with
a counter byte that happens to be equal to the padding byte.  The input
stream will be:

Previous output: 8 (or 4 for HalfSipHash) bytes
Entropy: 15 bytes (8 bytes timer, 4 bytes jiffies, 3 bytes pid)
Secret: 16 bytes
Counter: 1 byte
...repeat...

> We could also periodically mix in a big (128-bit?) chunk of fresh
> urandom output to keep the bad guys guessing.

Simpler and faster to just update the global master secret.
The state is per-CPU, so mixing in has to be repeated per CPU.


With these changes, I'm satisifed that it's secure, cheap, has a
sufficiently wide state size, *and* all standard SipHash analysis applies.

The only remaining issues are:
1) How many rounds, and
2) May we use HalfSipHash?

I'd *like* to persuade you that skipping the padding byte wouldn't
invalidate any security proofs, because it's true and would simplify
the code.  But if you want 100% stock, I'm willing to cater to that.

Ted, what do you think?


Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this?  No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark.  I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space.  (Voluminous but
straightforward.)  E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code.  The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
  so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
  of everything from _init to _fini (i.e. all the non-library code).
  This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test.  The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold: 40843488358435843568
tcpv4 md5 cold: 1052 996 9961060 996
tcpv6 siphash cold: 40803296331232963312
tcpv4 siphash cold: 29682748297227162716
tcpv6 md5 hot:   900 712 712712  712
tcpv4 md5 hot:   632 672 672672  672
tcpv6 siphash hot:  24842292234023402340
tcpv4 siphash hot:  16601560156423401564

SipHash actually wins slightly in the cold-cache case, because
it iterates more.  In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold: 33962868296430122832
tcpv4 md5 cold: 13681044132013321308
tcpv6 siphash cold: 29402952291624482604
tcpv4 siphash cold: 31922988357635043624
tcpv6 md5 hot:  11161032 99610081008
tcpv4 md5 hot:   936 936 936 936 936
tcpv6 siphash hot:  12001236123611881188
tcpv4 siphash hot:   936 804 804 804 804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold: 60866136696263586060
tcpv4 md5 cold:  816 732104610541012
tcpv6 siphash cold: 37561886215223902566
tcpv4 siphash cold: 32642108302631203526
tcpv6 md5 hot:  1062 808 824 824 832
tcpv4 md5 hot:   730 730 740 748 748
tcpv6 siphash hot:   960 952 9361112 926
tcpv4 siphash hot:   638 544 562 552 560

Modern processors *hate* cold caches.  But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold: 46803672395636163525
tcpv4 md5 cold: 10661416117911791134
tcpv6 siphash cold:  9401258199516092255
tcpv4 siphash cold: 14401269129218701621
tcpv6 md5 hot:  1372112210881088
tcpv4 md5 hot:   997 997 997 997 998
tcpv6 siphash hot:   340 340 340 352 340
tcpv4 siphash hot:   227 238 238 238 238

Of course, when you compile -m64, SipHash is unbeatable.


Here's the modified benchmark() code.  The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
extern char const _init, _fini;
char const *p = &_init;

while (p < &_fini) {
asm("clflush %0" : : "m" (*p));
p += 64;
}
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
uint32_t eax, edx;
asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
return eax;
}

static int benchmark(void)
{
cycles_t start, finish;
int i;
u32 seq_number = 0;
__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 
222, 0xff10 };
__be32 saddr4 = 2, daddr4 = 182112;
__be16 sport = 22, dport = 41992;

Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this?  No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark.  I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space.  (Voluminous but
straightforward.)  E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code.  The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
  so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
  of everything from _init to _fini (i.e. all the non-library code).
  This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test.  The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold: 40843488358435843568
tcpv4 md5 cold: 1052 996 9961060 996
tcpv6 siphash cold: 40803296331232963312
tcpv4 siphash cold: 29682748297227162716
tcpv6 md5 hot:   900 712 712712  712
tcpv4 md5 hot:   632 672 672672  672
tcpv6 siphash hot:  24842292234023402340
tcpv4 siphash hot:  16601560156423401564

SipHash actually wins slightly in the cold-cache case, because
it iterates more.  In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold: 33962868296430122832
tcpv4 md5 cold: 13681044132013321308
tcpv6 siphash cold: 29402952291624482604
tcpv4 siphash cold: 31922988357635043624
tcpv6 md5 hot:  11161032 99610081008
tcpv4 md5 hot:   936 936 936 936 936
tcpv6 siphash hot:  12001236123611881188
tcpv4 siphash hot:   936 804 804 804 804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold: 60866136696263586060
tcpv4 md5 cold:  816 732104610541012
tcpv6 siphash cold: 37561886215223902566
tcpv4 siphash cold: 32642108302631203526
tcpv6 md5 hot:  1062 808 824 824 832
tcpv4 md5 hot:   730 730 740 748 748
tcpv6 siphash hot:   960 952 9361112 926
tcpv4 siphash hot:   638 544 562 552 560

Modern processors *hate* cold caches.  But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold: 46803672395636163525
tcpv4 md5 cold: 10661416117911791134
tcpv6 siphash cold:  9401258199516092255
tcpv4 siphash cold: 14401269129218701621
tcpv6 md5 hot:  1372112210881088
tcpv4 md5 hot:   997 997 997 997 998
tcpv6 siphash hot:   340 340 340 352 340
tcpv4 siphash hot:   227 238 238 238 238

Of course, when you compile -m64, SipHash is unbeatable.


Here's the modified benchmark() code.  The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
extern char const _init, _fini;
char const *p = &_init;

while (p < &_fini) {
asm("clflush %0" : : "m" (*p));
p += 64;
}
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
uint32_t eax, edx;
asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
return eax;
}

static int benchmark(void)
{
cycles_t start, finish;
int i;
u32 seq_number = 0;
__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 
222, 0xff10 };
__be32 saddr4 = 2, daddr4 = 182112;
__be16 sport = 22, dport = 41992;

Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
As a separate message, to disentangle the threads, I'd like to
talk about get_random_long().

After some thinking, I still like the "state-preserving" construct
that's equivalent to the current MD5 code.  Yes, we could just do
siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
preserve a bit more.

It requires library support from the SipHash code to return the full
SipHash state, but I hope that's a fair thing to ask for.

Here's my current straw man design for comment.  It's very similar to
the current MD5-based design, but feeds all the seed material in the
"correct" way, as opposed to Xring directly into the MD5 state.

* Each CPU has a (Half)SipHash state vector,
  "unsigned long get_random_int_hash[4]".  Unlike the current
  MD5 code, we take care to initialize it to an asymmetric state.

* There's a global 256-bit random_int_secret (which we could
  reseed periodically).

To generate a random number:
* If get_random_int_hash is all-zero, seed it with fresh a half-sized
  SipHash key and the appropriate XOR constants.
* Generate three words of random_get_entropy(), jiffies, and current->pid.
  (This is arbitary seed material, copied from the current code.)
* Crank through that with (Half)SipHash-1-0.
* Crank through the random_int_secret with (Half)SipHash-1-0.
* Return v1 ^ v3.

Here are the reasons:
* The first step is just paranoia, but SipHash's security promise depends
  on starting with an asymmetric state, we want unique per-CPU states,
  and it's a one-time cost.
* When the input words are themselves secret, there's no security
  advantage, and almost no speed advantage, to doing two rounds for one
  input word versus two words with one round each.  Thus, SipHash-1.
* The above is not exactly true, due to the before+after XOR pattern
  that SipHash uses, but I think it's true anyway.
* Likewise, there's no benefit to unkeyed finalization rounds over keyed
  ones.  That's why I just enlarged the global secret.
* The per-call seed material is hashed first on general principles,
  because that's the novel part that might have fresh entropy.
* To the extent the initial state is secret, the rounds processing the
  global secret are 4 finalization rounds for the initial state and
  the per-call entropy.
* The final word(s) of the global secret might be vulnerable to analysis,
  due to incomplete mixing, but since the global secret is always hashed
  in the same order, and larger that the desired security level, the
  initial words should be secure.
* By carrying forward the full internal state, we ensure that repeated
  calls return different results, and to the extent that the per-call
  seed material has entropy, it's preserved.
* The final return is all that's needed, since the last steps in the 
  SipRound are "v1 ^= v2" and "v3 ^= v0".  It's no security loss,
  and a very minor speedup.
* Also, this avoids directly "exposing" the final XOR with the last
  word of the global secret (which is made to v0).

If I'm allowed to use full SipHash, some shortcuts can be taken,
but I believe the above would be secure with HalfSipHash.

If additional performance is required, I'd consider shrinking the
global secret to 192 bits on 32-bit machines but I want more than
128 bits of ey material, and enough rounds to be equivalent to 4
finalization rounds.


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
As a separate message, to disentangle the threads, I'd like to
talk about get_random_long().

After some thinking, I still like the "state-preserving" construct
that's equivalent to the current MD5 code.  Yes, we could just do
siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
preserve a bit more.

It requires library support from the SipHash code to return the full
SipHash state, but I hope that's a fair thing to ask for.

Here's my current straw man design for comment.  It's very similar to
the current MD5-based design, but feeds all the seed material in the
"correct" way, as opposed to Xring directly into the MD5 state.

* Each CPU has a (Half)SipHash state vector,
  "unsigned long get_random_int_hash[4]".  Unlike the current
  MD5 code, we take care to initialize it to an asymmetric state.

* There's a global 256-bit random_int_secret (which we could
  reseed periodically).

To generate a random number:
* If get_random_int_hash is all-zero, seed it with fresh a half-sized
  SipHash key and the appropriate XOR constants.
* Generate three words of random_get_entropy(), jiffies, and current->pid.
  (This is arbitary seed material, copied from the current code.)
* Crank through that with (Half)SipHash-1-0.
* Crank through the random_int_secret with (Half)SipHash-1-0.
* Return v1 ^ v3.

Here are the reasons:
* The first step is just paranoia, but SipHash's security promise depends
  on starting with an asymmetric state, we want unique per-CPU states,
  and it's a one-time cost.
* When the input words are themselves secret, there's no security
  advantage, and almost no speed advantage, to doing two rounds for one
  input word versus two words with one round each.  Thus, SipHash-1.
* The above is not exactly true, due to the before+after XOR pattern
  that SipHash uses, but I think it's true anyway.
* Likewise, there's no benefit to unkeyed finalization rounds over keyed
  ones.  That's why I just enlarged the global secret.
* The per-call seed material is hashed first on general principles,
  because that's the novel part that might have fresh entropy.
* To the extent the initial state is secret, the rounds processing the
  global secret are 4 finalization rounds for the initial state and
  the per-call entropy.
* The final word(s) of the global secret might be vulnerable to analysis,
  due to incomplete mixing, but since the global secret is always hashed
  in the same order, and larger that the desired security level, the
  initial words should be secure.
* By carrying forward the full internal state, we ensure that repeated
  calls return different results, and to the extent that the per-call
  seed material has entropy, it's preserved.
* The final return is all that's needed, since the last steps in the 
  SipRound are "v1 ^= v2" and "v3 ^= v0".  It's no security loss,
  and a very minor speedup.
* Also, this avoids directly "exposing" the final XOR with the last
  word of the global secret (which is made to v0).

If I'm allowed to use full SipHash, some shortcuts can be taken,
but I believe the above would be secure with HalfSipHash.

If additional performance is required, I'd consider shrinking the
global secret to 192 bits on 32-bit machines but I want more than
128 bits of ey material, and enough rounds to be equivalent to 4
finalization rounds.


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Theodore Ts'o wrote:
> On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
>> SipHash annihilates the competition on 64-bit superscalar hardware.
>> SipHash dominates the field on 64-bit in-order hardware.
>> SipHash wins easily on 32-bit hardware *with enough registers*.
>> On register-starved 32-bit machines, it really struggles.

> And "with enough registers" includes ARM and MIPS, right?

Yes.  As a matter of fact, 32-bit ARM does particularly well
on 64-bit SipHash due to its shift+op instructions.

There is a noticeable performance drop, but nothing catastrophic.

The main thing I've been worried about is all the flow tracking
and NAT done by small home routers, and that's addressed by using
HalfSipHash for the hash tables.  They don't *initiate* a lot of
TCP sessions.

> So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections.  :-)

The only requirement on performance is "don't make DaveM angry." :-)

I was just trying to answer the question of why we *worried* about the
performance, not specifically argue that we *should* use HalfSipHash.


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Theodore Ts'o wrote:
> On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
>> SipHash annihilates the competition on 64-bit superscalar hardware.
>> SipHash dominates the field on 64-bit in-order hardware.
>> SipHash wins easily on 32-bit hardware *with enough registers*.
>> On register-starved 32-bit machines, it really struggles.

> And "with enough registers" includes ARM and MIPS, right?

Yes.  As a matter of fact, 32-bit ARM does particularly well
on 64-bit SipHash due to its shift+op instructions.

There is a noticeable performance drop, but nothing catastrophic.

The main thing I've been worried about is all the flow tracking
and NAT done by small home routers, and that's addressed by using
HalfSipHash for the hash tables.  They don't *initiate* a lot of
TCP sessions.

> So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections.  :-)

The only requirement on performance is "don't make DaveM angry." :-)

I was just trying to answer the question of why we *worried* about the
performance, not specifically argue that we *should* use HalfSipHash.


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Eric Dumazet wrote:
> Now I am quite confused.
>
> George said :
>> Cycles per byte on 1024 bytes of data:
>>   Pentium Core 2  Ivy
>>   4   Duo Bridge
>> SipHash-2-4   38.9 8.3 5.8
>> HalfSipHash-2-4   12.7 4.5 3.2
>> MD58.3 5.7 4.7
>
> That really was for 1024 bytes blocks, so pretty much useless for our
> discussion ?

No, they're actually quite relevant, but you have to interpret them
correctly.  I thought I explained in the text following that table,
but let me make it clearer:

To find the time to compute the SipHash of N bytes, round (N+17) up to
the next multiple of 8 bytes and multiply by the numbers above.

To find the time to compute the HalfSipHash of N bytes, round (N+9) up to
the next multiple of 4 bytes and multiply by the numbers above.

To find the time to compute the MD5 of N bytes, round (N+9) up to the
next multiple of 64 bytes and multiply by the numbers above.

It's the different rounding rules that make all the difference.  For small
input blocks, SipHash can be slower per byte yet still faster because
it hashes fewer bytes.

> Reading your numbers last week, I thought SipHash was faster, but George
> numbers are giving the opposite impression.

SipHash annihilates the competition on 64-bit superscalar hardware.
SipHash dominates the field on 64-bit in-order hardware.
SipHash wins easily on 32-bit hardware *with enough registers*.
On register-starved 32-bit machines, it really struggles.

As I explained, in that last case, SipHash barely wins at all.
(On a P4, it actually *loses* to MD5, not that anyone cares.  Running
on a P4 and caring about performance are mutually exclusive.)


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Eric Dumazet wrote:
> Now I am quite confused.
>
> George said :
>> Cycles per byte on 1024 bytes of data:
>>   Pentium Core 2  Ivy
>>   4   Duo Bridge
>> SipHash-2-4   38.9 8.3 5.8
>> HalfSipHash-2-4   12.7 4.5 3.2
>> MD58.3 5.7 4.7
>
> That really was for 1024 bytes blocks, so pretty much useless for our
> discussion ?

No, they're actually quite relevant, but you have to interpret them
correctly.  I thought I explained in the text following that table,
but let me make it clearer:

To find the time to compute the SipHash of N bytes, round (N+17) up to
the next multiple of 8 bytes and multiply by the numbers above.

To find the time to compute the HalfSipHash of N bytes, round (N+9) up to
the next multiple of 4 bytes and multiply by the numbers above.

To find the time to compute the MD5 of N bytes, round (N+9) up to the
next multiple of 64 bytes and multiply by the numbers above.

It's the different rounding rules that make all the difference.  For small
input blocks, SipHash can be slower per byte yet still faster because
it hashes fewer bytes.

> Reading your numbers last week, I thought SipHash was faster, but George
> numbers are giving the opposite impression.

SipHash annihilates the competition on 64-bit superscalar hardware.
SipHash dominates the field on 64-bit in-order hardware.
SipHash wins easily on 32-bit hardware *with enough registers*.
On register-starved 32-bit machines, it really struggles.

As I explained, in that last case, SipHash barely wins at all.
(On a P4, it actually *loses* to MD5, not that anyone cares.  Running
on a P4 and caring about performance are mutually exclusive.)


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Linus wrote:
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.

I think I've been thoroughly dissuaded, but just to clarify one thing
that resembles a misunderstanding:

> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Everything being discussed is per-TCP-connection overhead, *not* per
packet.  (Twice for outgoing connections, because one is to generate
the ephemeral port number.)

I know you know this, but I don't want anyone spectating to be confused
about it.


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Linus wrote:
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.

I think I've been thoroughly dissuaded, but just to clarify one thing
that resembles a misunderstanding:

> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Everything being discussed is per-TCP-connection overhead, *not* per
packet.  (Twice for outgoing connections, because one is to generate
the ephemeral port number.)

I know you know this, but I don't want anyone spectating to be confused
about it.


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Actually, DJB just made a very relevant suggestion.

As I've mentioned, the 32-bit performance problems are an x86-specific
problem.  ARM does very well, and other processors aren't bad at all.

SipHash fits very nicely (and runs very fast) in the MMX registers.

They're 64 bits, and there are 8 of them, so the integer registers can
be reserved for pointers and loop counters and all that.  And there's
reference code available.

How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Although there are a lot of pre-MMX x86es in embedded control applications,
I don't think anyone is worried about their networking performance.
(Specifically, all of this affects only connection setup, not throughput 
on established connections.)


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread George Spelvin
Actually, DJB just made a very relevant suggestion.

As I've mentioned, the 32-bit performance problems are an x86-specific
problem.  ARM does very well, and other processors aren't bad at all.

SipHash fits very nicely (and runs very fast) in the MMX registers.

They're 64 bits, and there are 8 of them, so the integer registers can
be reserved for pointers and loop counters and all that.  And there's
reference code available.

How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Although there are a lot of pre-MMX x86es in embedded control applications,
I don't think anyone is worried about their networking performance.
(Specifically, all of this affects only connection setup, not throughput 
on established connections.)


Re: HalfSipHash Acceptable Usage

2016-12-20 Thread George Spelvin
Eric Dumazet wrote:
> On Tue, 2016-12-20 at 22:28 -0500, George Spelvin wrote:
>> Cycles per byte on 1024 bytes of data:
>>  Pentium Core 2  Ivy
>>  4   Duo Bridge
>> SipHash-2-4  38.9 8.3 5.8
>> HalfSipHash-2-4  12.7 4.5 3.2
>> MD5   8.3 5.7 4.7
>
> So definitely not faster.
> 
> 38 cycles per byte is a problem, considering IPV6 is ramping up.

As I said earlier, SipHash performance on 32-bit x86 really sucks,
because it wants an absolute minimum of 9 32-bit registers (8 for the
state plus one temporary for the rotates), and x86 has 7.

> What about SHA performance (syncookies) on P4 ?

I recompiled with -mtune=pentium4 and re-ran.  MD5 time went *up* by
0.3 cycles/byte, HalfSipHash went down by 1 cycle, and SipHash didn't
change:

Cycles per byte on 1024 bytes of data:
Pentium Core 2  Ivy
4   Duo Bridge
SipHash-2-4 38.9 8.3 5.8
HalfSipHash-2-4 11.5 4.5 3.2
MD5  8.6 5.7 4.7
SHA-1   19.0 8.0 6.8

(This is with a verbatim copy of the lib/sha1.c code; I might be
able to optimize it with some asm hackery.)

Anyway, you see why we were looking longingly at HalfSipHash.


In fact, I have an idea.  Allow me to make the following concrete
suggestion for using HalfSipHash with 128 bits of key material:

- 64 bits are used as the key.
- The other 64 bits are used as an IV which is prepended to
  the message to be hashed.

As a matter of practical implementation, we precompute the effect
of hashing the IV and store the 128-bit HalfSipHash state, which
is used just like a 128-bit key.

Because of the way it is constructed, it is obviously no weaker than
standard HalfSipHash's 64-bit security claim.

I don't know the security of this, and it's almost certainly weaker than
128 bits, but I *hope* it's at least a few bits stronger than 64 bits.
80 would be enough to dissuade any attacker without a six-figure budget
(that's per attack, not a one-time capital investment).  96 would be
ample for our purposes.

What I do know is that it makes a brute-force attack without
significant cryptanalytic effort impossible.

To match the spec exactly, we'd need to add the 8-byte IV length to
the length byte which pads the final block, but from a security point
of view, it does not matter.  As long as we are consistent within any
single key, any unique mapping between padding byte and message length
(mod 256) is equally good.

We may choose based on implementation convenience.

(Also note my earlier comments about when it is okay to omit the padding
length byte entirely: any time all the data to be hashed with a given
key is fixed in format or self-delimiting (e.g. null-terminated).
This applies to many of the networking uses.)


Re: HalfSipHash Acceptable Usage

2016-12-20 Thread George Spelvin
Eric Dumazet wrote:
> On Tue, 2016-12-20 at 22:28 -0500, George Spelvin wrote:
>> Cycles per byte on 1024 bytes of data:
>>  Pentium Core 2  Ivy
>>  4   Duo Bridge
>> SipHash-2-4  38.9 8.3 5.8
>> HalfSipHash-2-4  12.7 4.5 3.2
>> MD5   8.3 5.7 4.7
>
> So definitely not faster.
> 
> 38 cycles per byte is a problem, considering IPV6 is ramping up.

As I said earlier, SipHash performance on 32-bit x86 really sucks,
because it wants an absolute minimum of 9 32-bit registers (8 for the
state plus one temporary for the rotates), and x86 has 7.

> What about SHA performance (syncookies) on P4 ?

I recompiled with -mtune=pentium4 and re-ran.  MD5 time went *up* by
0.3 cycles/byte, HalfSipHash went down by 1 cycle, and SipHash didn't
change:

Cycles per byte on 1024 bytes of data:
Pentium Core 2  Ivy
4   Duo Bridge
SipHash-2-4 38.9 8.3 5.8
HalfSipHash-2-4 11.5 4.5 3.2
MD5  8.6 5.7 4.7
SHA-1   19.0 8.0 6.8

(This is with a verbatim copy of the lib/sha1.c code; I might be
able to optimize it with some asm hackery.)

Anyway, you see why we were looking longingly at HalfSipHash.


In fact, I have an idea.  Allow me to make the following concrete
suggestion for using HalfSipHash with 128 bits of key material:

- 64 bits are used as the key.
- The other 64 bits are used as an IV which is prepended to
  the message to be hashed.

As a matter of practical implementation, we precompute the effect
of hashing the IV and store the 128-bit HalfSipHash state, which
is used just like a 128-bit key.

Because of the way it is constructed, it is obviously no weaker than
standard HalfSipHash's 64-bit security claim.

I don't know the security of this, and it's almost certainly weaker than
128 bits, but I *hope* it's at least a few bits stronger than 64 bits.
80 would be enough to dissuade any attacker without a six-figure budget
(that's per attack, not a one-time capital investment).  96 would be
ample for our purposes.

What I do know is that it makes a brute-force attack without
significant cryptanalytic effort impossible.

To match the spec exactly, we'd need to add the 8-byte IV length to
the length byte which pads the final block, but from a security point
of view, it does not matter.  As long as we are consistent within any
single key, any unique mapping between padding byte and message length
(mod 256) is equally good.

We may choose based on implementation convenience.

(Also note my earlier comments about when it is okay to omit the padding
length byte entirely: any time all the data to be hashed with a given
key is fixed in format or self-delimiting (e.g. null-terminated).
This applies to many of the networking uses.)


Re: HalfSipHash Acceptable Usage

2016-12-20 Thread George Spelvin
> I do not see why SipHash, if faster than MD5 and more secure, would be a
> problem.

Because on 32-bit x86, it's slower.

Cycles per byte on 1024 bytes of data:
Pentium Core 2  Ivy
4   Duo Bridge
SipHash-2-4 38.9 8.3 5.8
HalfSipHash-2-4 12.7 4.5 3.2
MD5  8.3 5.7 4.7

SipHash is more parallelizable and runs faster on superscalar processors,
but MD5 is optimized for 2000-era processors, and is faster on them than
HalfSipHash even.

Now, in the applications we care about, we're hashing short blocks, and
SipHash has the advantage that it can hash less than 64 bytes.  But it
also pays a penalty on short blocks for the finalization, equivalent to
two words (16 bytes) of input.

It turns out that on both Ivy Bridge and Core 2 Duo, the crossover happens
between 23 (SipHash is faster) and 24 (MD5 is faster) bytes of input.

This is assuming you're adding the 1 byte of length padding to SipHash's
input, so 24 bytes pads to 4 64-bit words, which makes 2*4+4 = 12 rounds,
vs. one block for MD5.  (MD5 takes a similar jump between 55 and 56 bytes.)

On a P4, SipHash is *never* faster; it takes 2.5x longer than MD5 on a
12-byte block (an IPv4 address/port pair).

This is why there was discussion of using HalfSipHash on these machines.
(On a P4, the HalfSipHash/MD5 crossover is somewhere between 24 and 31
bytes; I haven't benchmarked every possible size.)


Re: HalfSipHash Acceptable Usage

2016-12-20 Thread George Spelvin
> I do not see why SipHash, if faster than MD5 and more secure, would be a
> problem.

Because on 32-bit x86, it's slower.

Cycles per byte on 1024 bytes of data:
Pentium Core 2  Ivy
4   Duo Bridge
SipHash-2-4 38.9 8.3 5.8
HalfSipHash-2-4 12.7 4.5 3.2
MD5  8.3 5.7 4.7

SipHash is more parallelizable and runs faster on superscalar processors,
but MD5 is optimized for 2000-era processors, and is faster on them than
HalfSipHash even.

Now, in the applications we care about, we're hashing short blocks, and
SipHash has the advantage that it can hash less than 64 bytes.  But it
also pays a penalty on short blocks for the finalization, equivalent to
two words (16 bytes) of input.

It turns out that on both Ivy Bridge and Core 2 Duo, the crossover happens
between 23 (SipHash is faster) and 24 (MD5 is faster) bytes of input.

This is assuming you're adding the 1 byte of length padding to SipHash's
input, so 24 bytes pads to 4 64-bit words, which makes 2*4+4 = 12 rounds,
vs. one block for MD5.  (MD5 takes a similar jump between 55 and 56 bytes.)

On a P4, SipHash is *never* faster; it takes 2.5x longer than MD5 on a
12-byte block (an IPv4 address/port pair).

This is why there was discussion of using HalfSipHash on these machines.
(On a P4, the HalfSipHash/MD5 crossover is somewhere between 24 and 31
bytes; I haven't benchmarked every possible size.)


Re: HalfSipHash Acceptable Usage

2016-12-20 Thread George Spelvin
Theodore Ts'o wrote:
> On Mon, Dec 19, 2016 at 06:32:44PM +0100, Jason A. Donenfeld wrote:
>> 1) Anything that requires actual long-term security will use
>> SipHash2-4, with the 64-bit output and the 128-bit key. This includes
>> things like TCP sequence numbers. This seems pretty uncontroversial to
>> me. Seem okay to you?

> Um, why do TCP sequence numbers need long-term security?  So long as
> you rekey every 5 minutes or so, TCP sequence numbers don't need any
> more security than that, since even if you break the key used to
> generate initial sequence numbers seven a minute or two later, any
> pending TCP connections will have timed out long before.
> 
> See the security analysis done in RFC 6528[1], where among other
> things, it points out why MD5 is acceptable with periodic rekeying,
> although there is the concern that this could break certain hueristics
> used when establishing new connections during the TIME-WAIT state.

Because we don't rekey TCP sequence numbers, ever.  See commit
6e5714eaf77d79ae1c8b47e3e040ff5411b717ec

To rekey them requires dividing the sequence number base into a "random"
part and some "generation" msbits.  While we can do better than the
previous 8+24 split (I'd suggest 4+28 or 3+29), only 2 is tricks, and
1 generation bit isn't enough.

So while it helps in the long term, it reduces the security offered by
the random part in the short term.  (If I know 4 bits of your ISN,
I only need to send 256 MB to hit your TCP window.)

At the time, I objected, and suggested doing two hashes, with a fixed
32-bit base plus a split rekeyed portion, but that was vetoed on the
grounds of performance.

On further consideration, the fixed base doesn't help much.
(Details below for anyone that cares.)



Suppose we let the TCP initial sequence number be:

(Hash(, fixed_key) & 0x) +
(i << 28) + (Hash(, key[i]) & 0x0fff) +
(current_time_in_nanoseconds / 64)

It's not hugely difficult to mount an effective attack against a
64-bit fixed_key.

As an attacker, I can ask the target to send me these numbers for dstPort
values i control and other values I know.  I can (with high probability)
detect the large jumps when the generation changes, so I can make a
significant number of queries with the same generation.  After 23-ish
queries, I have enough information to identify a 64-bit fixed_key.

I don't know the current generation counter "i", but I know it's the
same for all my queries, so for any two queries, the maximum difference
between the 28-bit hash values is 29 bits.  (We can also add a small
margin to allow for timeing uncertainty, but that's even less.)

So if I guess a fixed key, hash my known plaintexts with that guess,
subtract the ciphertexts from the observed sequence numbers, and the
difference between the remaining (unknown) 28-bit hash values plus
timestamps exceeds what's possible, my guess is wrong.

I can then repeat with additional known plaintexts, reducing the space
of admissible keys by about 3 bits each time.

Assuming I can rent GPU horsepower from a bitcoin miner to do this in a
reasonable period of time, after 22 known plaintext differences, I have
uniquely identified the key.

Of course, in practice I'd do is a first pass with maybe 6 plaintexts
on the GPU, and then deal with the candidates found in a second pass.
But either way, it's about 2.3 SipHash evaluations per key tested.
As I noted earlier, a bitcoin blockchain block, worth 25 bitcoins,
currently costs 2^71 evaluations of SHA-2 (2^70 evaluations of double
SHA-2), and that's accomplished every 10 minutes, this is definitely
practical.


Re: HalfSipHash Acceptable Usage

2016-12-20 Thread George Spelvin
Theodore Ts'o wrote:
> On Mon, Dec 19, 2016 at 06:32:44PM +0100, Jason A. Donenfeld wrote:
>> 1) Anything that requires actual long-term security will use
>> SipHash2-4, with the 64-bit output and the 128-bit key. This includes
>> things like TCP sequence numbers. This seems pretty uncontroversial to
>> me. Seem okay to you?

> Um, why do TCP sequence numbers need long-term security?  So long as
> you rekey every 5 minutes or so, TCP sequence numbers don't need any
> more security than that, since even if you break the key used to
> generate initial sequence numbers seven a minute or two later, any
> pending TCP connections will have timed out long before.
> 
> See the security analysis done in RFC 6528[1], where among other
> things, it points out why MD5 is acceptable with periodic rekeying,
> although there is the concern that this could break certain hueristics
> used when establishing new connections during the TIME-WAIT state.

Because we don't rekey TCP sequence numbers, ever.  See commit
6e5714eaf77d79ae1c8b47e3e040ff5411b717ec

To rekey them requires dividing the sequence number base into a "random"
part and some "generation" msbits.  While we can do better than the
previous 8+24 split (I'd suggest 4+28 or 3+29), only 2 is tricks, and
1 generation bit isn't enough.

So while it helps in the long term, it reduces the security offered by
the random part in the short term.  (If I know 4 bits of your ISN,
I only need to send 256 MB to hit your TCP window.)

At the time, I objected, and suggested doing two hashes, with a fixed
32-bit base plus a split rekeyed portion, but that was vetoed on the
grounds of performance.

On further consideration, the fixed base doesn't help much.
(Details below for anyone that cares.)



Suppose we let the TCP initial sequence number be:

(Hash(, fixed_key) & 0x) +
(i << 28) + (Hash(, key[i]) & 0x0fff) +
(current_time_in_nanoseconds / 64)

It's not hugely difficult to mount an effective attack against a
64-bit fixed_key.

As an attacker, I can ask the target to send me these numbers for dstPort
values i control and other values I know.  I can (with high probability)
detect the large jumps when the generation changes, so I can make a
significant number of queries with the same generation.  After 23-ish
queries, I have enough information to identify a 64-bit fixed_key.

I don't know the current generation counter "i", but I know it's the
same for all my queries, so for any two queries, the maximum difference
between the 28-bit hash values is 29 bits.  (We can also add a small
margin to allow for timeing uncertainty, but that's even less.)

So if I guess a fixed key, hash my known plaintexts with that guess,
subtract the ciphertexts from the observed sequence numbers, and the
difference between the remaining (unknown) 28-bit hash values plus
timestamps exceeds what's possible, my guess is wrong.

I can then repeat with additional known plaintexts, reducing the space
of admissible keys by about 3 bits each time.

Assuming I can rent GPU horsepower from a bitcoin miner to do this in a
reasonable period of time, after 22 known plaintext differences, I have
uniquely identified the key.

Of course, in practice I'd do is a first pass with maybe 6 plaintexts
on the GPU, and then deal with the candidates found in a second pass.
But either way, it's about 2.3 SipHash evaluations per key tested.
As I noted earlier, a bitcoin blockchain block, worth 25 bitcoins,
currently costs 2^71 evaluations of SHA-2 (2^70 evaluations of double
SHA-2), and that's accomplished every 10 minutes, this is definitely
practical.


RE: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-19 Thread George Spelvin
David Laight wrote:
> From: George Spelvin
...
>> uint32_t
>> hsiphash24(char const *in, size_t len, uint32_t const key[2])
>> {
>>  uint32_t c = key[0];
>>  uint32_t d = key[1];
>>  uint32_t a = 0x6c796765 ^ 0x736f6d65;
>>  uint32_t b = d ^ 0x74656462 ^ 0x646f7261;

> I've not looked closely, but is that (in some sense) duplicating
> the key length?
> So you could set a = key[2] and b = key[3] and still have an
> working hash - albeit not exactly the one specified.

That's tempting, but not necessarily effective.  (A similar unsuccesful
idea can be found in discussions of "DES with independent round keys".
Or see the design discussion of Salsa20 and the constants in its input.)

You can increase the key size, but that might not increase the *security*
any.

The big issue is that there are a *lot* of square root attack in
cryptanalysis.  Because SipHash's state is twice the size of the key,
such an attack will have the same complexity as key exhaustion and need
not be considered.  To make a stronger security claim, you need to start
working through them all and show that they don't apply.

For SipHash in particular, an important property is asymmetry of the
internal state.  That's what duplicating the key with XORs guarantees.
If the two halves of the state end up identical, the mixing is much
weaker.

Now the probability of ending up in a "mirror state" is the square
root of the state size (1 in 2^64 for HalfSipHash's 128-bit state),
which is the same probability as guessing a key, so it's not a
problem that has to be considered when making a 64-bit security claim.

But if you want a higher security level, you have to think about
what can happen.

That said, I have been thinking very hard about

a = c ^ 0x48536970; /* 'HSip' */
d = key[2];

By guaranteeing that a and c are different, we get the desired
asymmetry, and the XOR of b and d is determined by the first word of
the message anyway, so this isn't weakening anything.

96 bits is far beyond the reach of any brute-force attack, and if a
more sophisticated 64-bit attack exists, it's at least out of the reach
of the script kiddies, and will almost certainly have a non-negligible
constant factor and more limits in when it can be applied.

> Is it worth using the 32bit hash for IP addresses on 64bit systems that
> can't do misaligned accessed?

Not a good idea.  To hash 64 bits of input:

* Full SipHash has to do two loads, a shift, an or, and two rounds of mixing.
* HalfSipHash has to do a load, two rounds, another load, and two more rounds.

In other words, in addition to being less secure, it's half the speed.  

Also, what systems are you thinking about?  x86, ARMv8, PowerPC, and
S390 (and ia64, if anyone cares) all handle unaligned loads.  MIPS has
efficient support.  Alpha and HPPA are for retrocomputing fans, not
people who care about performance.

So you're down to SPARC.  Which conveniently has the same maintainer as
the networking code, so I figure DaveM can take care of that himself. :-)


RE: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-19 Thread George Spelvin
David Laight wrote:
> From: George Spelvin
...
>> uint32_t
>> hsiphash24(char const *in, size_t len, uint32_t const key[2])
>> {
>>  uint32_t c = key[0];
>>  uint32_t d = key[1];
>>  uint32_t a = 0x6c796765 ^ 0x736f6d65;
>>  uint32_t b = d ^ 0x74656462 ^ 0x646f7261;

> I've not looked closely, but is that (in some sense) duplicating
> the key length?
> So you could set a = key[2] and b = key[3] and still have an
> working hash - albeit not exactly the one specified.

That's tempting, but not necessarily effective.  (A similar unsuccesful
idea can be found in discussions of "DES with independent round keys".
Or see the design discussion of Salsa20 and the constants in its input.)

You can increase the key size, but that might not increase the *security*
any.

The big issue is that there are a *lot* of square root attack in
cryptanalysis.  Because SipHash's state is twice the size of the key,
such an attack will have the same complexity as key exhaustion and need
not be considered.  To make a stronger security claim, you need to start
working through them all and show that they don't apply.

For SipHash in particular, an important property is asymmetry of the
internal state.  That's what duplicating the key with XORs guarantees.
If the two halves of the state end up identical, the mixing is much
weaker.

Now the probability of ending up in a "mirror state" is the square
root of the state size (1 in 2^64 for HalfSipHash's 128-bit state),
which is the same probability as guessing a key, so it's not a
problem that has to be considered when making a 64-bit security claim.

But if you want a higher security level, you have to think about
what can happen.

That said, I have been thinking very hard about

a = c ^ 0x48536970; /* 'HSip' */
d = key[2];

By guaranteeing that a and c are different, we get the desired
asymmetry, and the XOR of b and d is determined by the first word of
the message anyway, so this isn't weakening anything.

96 bits is far beyond the reach of any brute-force attack, and if a
more sophisticated 64-bit attack exists, it's at least out of the reach
of the script kiddies, and will almost certainly have a non-negligible
constant factor and more limits in when it can be applied.

> Is it worth using the 32bit hash for IP addresses on 64bit systems that
> can't do misaligned accessed?

Not a good idea.  To hash 64 bits of input:

* Full SipHash has to do two loads, a shift, an or, and two rounds of mixing.
* HalfSipHash has to do a load, two rounds, another load, and two more rounds.

In other words, in addition to being less secure, it's half the speed.  

Also, what systems are you thinking about?  x86, ARMv8, PowerPC, and
S390 (and ia64, if anyone cares) all handle unaligned loads.  MIPS has
efficient support.  Alpha and HPPA are for retrocomputing fans, not
people who care about performance.

So you're down to SPARC.  Which conveniently has the same maintainer as
the networking code, so I figure DaveM can take care of that himself. :-)


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread George Spelvin
To follow up on my comments that your benchmark results were peculiar,
here's my benchmark code.

It just computes the hash of all n*(n+1)/2 possible non-empty substrings
of a buffer of n (called "max" below) bytes.  "cpb" is "cycles per byte".

(The average length is (n+2)/3, c.f. https://oeis.org/A000292)

On x86-32, HSipHash is asymptotically twice the speed of SipHash,
rising to 2.5x for short strings:

SipHash/HSipHash benchmark, sizeof(long) = 4
 SipHash: max=   4 cycles= 10495 cpb=524.7500 (sum=47a4f5554869fa97)
HSipHash: max=   4 cycles=  3400 cpb=170. (sum=146a863e)
 SipHash: max=   8 cycles= 24468 cpb=203.9000 (sum=21c41a86355affcc)
HSipHash: max=   8 cycles=  9237 cpb= 76.9750 (sum=d3b5e0cd)
 SipHash: max=  16 cycles= 94622 cpb=115.9583 (sum=26d816b72721e48f)
HSipHash: max=  16 cycles= 34499 cpb= 42.2782 (sum=16bb7475)
 SipHash: max=  32 cycles=418767 cpb= 69.9811 (sum=dd5a97694b8a832d)
HSipHash: max=  32 cycles=156695 cpb= 26.1857 (sum=eed00fcb)
 SipHash: max=  64 cycles=   2119152 cpb= 46.3101 (sum=a2a725aecc09ed00)
HSipHash: max=  64 cycles=   1008678 cpb= 22.0428 (sum=99b9f4f)
 SipHash: max= 128 cycles=  12728659 cpb= 35.5788 (sum=420878cd20272817)
HSipHash: max= 128 cycles=   5452931 cpb= 15.2419 (sum=f1f4ad18)
 SipHash: max= 256 cycles=  38931946 cpb= 13.7615 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles=  13807312 cpb=  4.8805 (sum=ceeafcc1)
 SipHash: max= 512 cycles= 205537380 cpb=  9.1346 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles= 103420960 cpb=  4.5963 (sum=7f15a313)
 SipHash: max=1024 cycles=1540259472 cpb=  8.5817 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 796090824 cpb=  4.4355 (sum=d8f3374f)

On x86-64, SipHash is consistently faster, asymptotically approaching 2x
for long strings:

SipHash/HSipHash benchmark, sizeof(long) = 8
 SipHash: max=   4 cycles=  2642 cpb=132.1000 (sum=47a4f5554869fa97)
HSipHash: max=   4 cycles=  2498 cpb=124.9000 (sum=146a863e)
 SipHash: max=   8 cycles=  5270 cpb= 43.9167 (sum=21c41a86355affcc)
HSipHash: max=   8 cycles=  7140 cpb= 59.5000 (sum=d3b5e0cd)
 SipHash: max=  16 cycles= 19950 cpb= 24.4485 (sum=26d816b72721e48f)
HSipHash: max=  16 cycles= 23546 cpb= 28.8554 (sum=16bb7475)
 SipHash: max=  32 cycles= 80188 cpb= 13.4004 (sum=dd5a97694b8a832d)
HSipHash: max=  32 cycles=101218 cpb= 16.9148 (sum=eed00fcb)
 SipHash: max=  64 cycles=373286 cpb=  8.1575 (sum=a2a725aecc09ed00)
HSipHash: max=  64 cycles=535568 cpb= 11.7038 (sum=99b9f4f)
 SipHash: max= 128 cycles=   2075224 cpb=  5.8006 (sum=420878cd20272817)
HSipHash: max= 128 cycles=   3336820 cpb=  9.3270 (sum=f1f4ad18)
 SipHash: max= 256 cycles=  14276278 cpb=  5.0463 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles=  28847880 cpb= 10.1970 (sum=ceeafcc1)
 SipHash: max= 512 cycles=  50135180 cpb=  2.2281 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles=  86145916 cpb=  3.8286 (sum=7f15a313)
 SipHash: max=1024 cycles= 334111900 cpb=  1.8615 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 640432452 cpb=  3.5682 (sum=d8f3374f)


Here's the code; compile with -DSELFTEST.  (The main purpose of
printing the sum is to prevent dead code elimination.)


#if SELFTEST
#include 
#include 

static inline uint64_t rol64(uint64_t word, unsigned int shift)
{
return word << shift | word >> (64 - shift);
}

static inline uint32_t rol32(uint32_t word, unsigned int shift)
{
return word << shift | word >> (32 - shift);
}

static inline uint64_t get_unaligned_le64(void const *p)
{
return *(uint64_t const *)p;
}

static inline uint32_t get_unaligned_le32(void const *p)
{
return *(uint32_t const *)p;
}

static inline uint64_t le64_to_cpup(uint64_t const *p)
{
return *p;
}

static inline uint32_t le32_to_cpup(uint32_t const *p)
{
return *p;
}


#else
#include/* For rol64 */
#include 
#include 
#include 
#endif

/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))

/*
 * The complete SipRound.  Note that, when unrolled twice like below,
 * the 32-bit rotates drop out on 32-bit machines.
 */
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
 SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))

/*
 * This is rolled up more than most implementations, resulting in about
 * 55% the code size.  Speed is a few precent slower.  A crude benchmark
 * (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
 * produces the following timings (in usec):
 *
 *  i386i386i386x86_64  x86_64  x86_64  x86_64
 * Length   small   unroll  halfmd4 small   unroll  halfmd4 teahash
 * 1..4106910291608 195 160 399 690
 * 1..8248323813851 410 360 9881659
 * 1..12   430341526207 690 61816422690
 * 1..16   61225931

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread George Spelvin
To follow up on my comments that your benchmark results were peculiar,
here's my benchmark code.

It just computes the hash of all n*(n+1)/2 possible non-empty substrings
of a buffer of n (called "max" below) bytes.  "cpb" is "cycles per byte".

(The average length is (n+2)/3, c.f. https://oeis.org/A000292)

On x86-32, HSipHash is asymptotically twice the speed of SipHash,
rising to 2.5x for short strings:

SipHash/HSipHash benchmark, sizeof(long) = 4
 SipHash: max=   4 cycles= 10495 cpb=524.7500 (sum=47a4f5554869fa97)
HSipHash: max=   4 cycles=  3400 cpb=170. (sum=146a863e)
 SipHash: max=   8 cycles= 24468 cpb=203.9000 (sum=21c41a86355affcc)
HSipHash: max=   8 cycles=  9237 cpb= 76.9750 (sum=d3b5e0cd)
 SipHash: max=  16 cycles= 94622 cpb=115.9583 (sum=26d816b72721e48f)
HSipHash: max=  16 cycles= 34499 cpb= 42.2782 (sum=16bb7475)
 SipHash: max=  32 cycles=418767 cpb= 69.9811 (sum=dd5a97694b8a832d)
HSipHash: max=  32 cycles=156695 cpb= 26.1857 (sum=eed00fcb)
 SipHash: max=  64 cycles=   2119152 cpb= 46.3101 (sum=a2a725aecc09ed00)
HSipHash: max=  64 cycles=   1008678 cpb= 22.0428 (sum=99b9f4f)
 SipHash: max= 128 cycles=  12728659 cpb= 35.5788 (sum=420878cd20272817)
HSipHash: max= 128 cycles=   5452931 cpb= 15.2419 (sum=f1f4ad18)
 SipHash: max= 256 cycles=  38931946 cpb= 13.7615 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles=  13807312 cpb=  4.8805 (sum=ceeafcc1)
 SipHash: max= 512 cycles= 205537380 cpb=  9.1346 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles= 103420960 cpb=  4.5963 (sum=7f15a313)
 SipHash: max=1024 cycles=1540259472 cpb=  8.5817 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 796090824 cpb=  4.4355 (sum=d8f3374f)

On x86-64, SipHash is consistently faster, asymptotically approaching 2x
for long strings:

SipHash/HSipHash benchmark, sizeof(long) = 8
 SipHash: max=   4 cycles=  2642 cpb=132.1000 (sum=47a4f5554869fa97)
HSipHash: max=   4 cycles=  2498 cpb=124.9000 (sum=146a863e)
 SipHash: max=   8 cycles=  5270 cpb= 43.9167 (sum=21c41a86355affcc)
HSipHash: max=   8 cycles=  7140 cpb= 59.5000 (sum=d3b5e0cd)
 SipHash: max=  16 cycles= 19950 cpb= 24.4485 (sum=26d816b72721e48f)
HSipHash: max=  16 cycles= 23546 cpb= 28.8554 (sum=16bb7475)
 SipHash: max=  32 cycles= 80188 cpb= 13.4004 (sum=dd5a97694b8a832d)
HSipHash: max=  32 cycles=101218 cpb= 16.9148 (sum=eed00fcb)
 SipHash: max=  64 cycles=373286 cpb=  8.1575 (sum=a2a725aecc09ed00)
HSipHash: max=  64 cycles=535568 cpb= 11.7038 (sum=99b9f4f)
 SipHash: max= 128 cycles=   2075224 cpb=  5.8006 (sum=420878cd20272817)
HSipHash: max= 128 cycles=   3336820 cpb=  9.3270 (sum=f1f4ad18)
 SipHash: max= 256 cycles=  14276278 cpb=  5.0463 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles=  28847880 cpb= 10.1970 (sum=ceeafcc1)
 SipHash: max= 512 cycles=  50135180 cpb=  2.2281 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles=  86145916 cpb=  3.8286 (sum=7f15a313)
 SipHash: max=1024 cycles= 334111900 cpb=  1.8615 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 640432452 cpb=  3.5682 (sum=d8f3374f)


Here's the code; compile with -DSELFTEST.  (The main purpose of
printing the sum is to prevent dead code elimination.)


#if SELFTEST
#include 
#include 

static inline uint64_t rol64(uint64_t word, unsigned int shift)
{
return word << shift | word >> (64 - shift);
}

static inline uint32_t rol32(uint32_t word, unsigned int shift)
{
return word << shift | word >> (32 - shift);
}

static inline uint64_t get_unaligned_le64(void const *p)
{
return *(uint64_t const *)p;
}

static inline uint32_t get_unaligned_le32(void const *p)
{
return *(uint32_t const *)p;
}

static inline uint64_t le64_to_cpup(uint64_t const *p)
{
return *p;
}

static inline uint32_t le32_to_cpup(uint32_t const *p)
{
return *p;
}


#else
#include/* For rol64 */
#include 
#include 
#include 
#endif

/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))

/*
 * The complete SipRound.  Note that, when unrolled twice like below,
 * the 32-bit rotates drop out on 32-bit machines.
 */
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
 SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))

/*
 * This is rolled up more than most implementations, resulting in about
 * 55% the code size.  Speed is a few precent slower.  A crude benchmark
 * (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
 * produces the following timings (in usec):
 *
 *  i386i386i386x86_64  x86_64  x86_64  x86_64
 * Length   small   unroll  halfmd4 small   unroll  halfmd4 teahash
 * 1..4106910291608 195 160 399 690
 * 1..8248323813851 410 360 9881659
 * 1..12   430341526207 690 61816422690
 * 1..16   61225931

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread George Spelvin
BTW, here's some SipHash code I wrote for Linux a while ago.

My target application was ext4 directory hashing, resulting in different
implementation choices, although I still think that a rolled-up
implementation like this is reasonable.  Reducing I-cache impact speeds
up the calling code.

One thing I'd like to suggest you steal is the way it handles the
fetch of the final partial word.  It's a lot smaller and faster than
an 8-way case statement.


#include/* For rol64 */
#include 
#include 
#include 

/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))

/*
 * The complete SipRound.  Note that, when unrolled twice like below,
 * the 32-bit rotates drop out on 32-bit machines.
 */
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
 SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))

/*
 * This is rolled up more than most implementations, resulting in about
 * 55% the code size.  Speed is a few precent slower.  A crude benchmark
 * (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
 * produces the following timings (in usec):
 *
 *  i386i386i386x86_64  x86_64  x86_64  x86_64
 * Length   small   unroll  halfmd4 small   unroll  halfmd4 teahash
 * 1..4106910291608 195 160 399 690
 * 1..8248323813851 410 360 9881659
 * 1..12   430341526207 690 61816422690
 * 1..16   612259318668 968 87623633786
 * 1..20   83488137   112451323118531625567
 * 1..24  10580   10327   139351657150440667635
 * 1..28  13211   12956   168032069187150289759
 * 1..32  15843   15572   19725247022606084   11932
 * 1..36  18864   18609   24259293426787566   14794
 * 1..1024  5890194 6130242 10264816 881933  881244 3617392 7589036
 *
 * The performance penalty is quite minor, decreasing for long strings,
 * and it's significantly faster than half_md4, so I'm going for the
 * I-cache win.
 */
uint64_t
siphash24(char const *in, size_t len, uint32_t const seed[4])
{
uint64_t a = 0x736f6d6570736575;/* somepseu */
uint64_t b = 0x646f72616e646f6d;/* dorandom */
uint64_t c = 0x6c7967656e657261;/* lygenera */
uint64_t d = 0x7465646279746573;/* tedbytes */
uint64_t m = 0;
uint8_t padbyte = len;

/*
 * Mix in the 128-bit hash seed.  This is in a format convenient
 * to the ext3/ext4 code.  Please feel free to adapt the
 * */
if (seed) {
m = seed[2] | (uint64_t)seed[3] << 32;
b ^= m;
d ^= m;
m = seed[0] | (uint64_t)seed[1] << 32;
/* a ^= m; is done in loop below */
c ^= m;
}

/*
 * By using the same SipRound code for all iterations, we
 * save space, at the expense of some branch prediction.  But
 * branch prediction is hard because of variable length anyway.
 */
len = len/8 + 3;/* Now number of rounds to perform */
do {
a ^= m;

switch (--len) {
unsigned bytes;

default:/* Full words */
d ^= m = get_unaligned_le64(in);
in += 8;
break;
case 2: /* Final partial word */
/*
 * We'd like to do one 64-bit fetch rather than
 * mess around with bytes, but reading past the end
 * might hit a protection boundary.  Fortunately,
 * we know that protection boundaries are aligned,
 * so we can consider only three cases:
 * - The remainder occupies zero words
 * - The remainder fits into one word
 * - The remainder straddles two words
 */
bytes = padbyte & 7;

if (bytes == 0) {
m = 0;
} else {
unsigned offset = (unsigned)(uintptr_t)in & 7;

if (offset + bytes <= 8) {
m = le64_to_cpup((uint64_t const *)
(in - offset));
m >>= 8*offset;
} else {
m = get_unaligned_le64(in);
}
m &= ((uint64_t)1 << 

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread George Spelvin
BTW, here's some SipHash code I wrote for Linux a while ago.

My target application was ext4 directory hashing, resulting in different
implementation choices, although I still think that a rolled-up
implementation like this is reasonable.  Reducing I-cache impact speeds
up the calling code.

One thing I'd like to suggest you steal is the way it handles the
fetch of the final partial word.  It's a lot smaller and faster than
an 8-way case statement.


#include/* For rol64 */
#include 
#include 
#include 

/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))

/*
 * The complete SipRound.  Note that, when unrolled twice like below,
 * the 32-bit rotates drop out on 32-bit machines.
 */
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
 SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))

/*
 * This is rolled up more than most implementations, resulting in about
 * 55% the code size.  Speed is a few precent slower.  A crude benchmark
 * (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
 * produces the following timings (in usec):
 *
 *  i386i386i386x86_64  x86_64  x86_64  x86_64
 * Length   small   unroll  halfmd4 small   unroll  halfmd4 teahash
 * 1..4106910291608 195 160 399 690
 * 1..8248323813851 410 360 9881659
 * 1..12   430341526207 690 61816422690
 * 1..16   612259318668 968 87623633786
 * 1..20   83488137   112451323118531625567
 * 1..24  10580   10327   139351657150440667635
 * 1..28  13211   12956   168032069187150289759
 * 1..32  15843   15572   19725247022606084   11932
 * 1..36  18864   18609   24259293426787566   14794
 * 1..1024  5890194 6130242 10264816 881933  881244 3617392 7589036
 *
 * The performance penalty is quite minor, decreasing for long strings,
 * and it's significantly faster than half_md4, so I'm going for the
 * I-cache win.
 */
uint64_t
siphash24(char const *in, size_t len, uint32_t const seed[4])
{
uint64_t a = 0x736f6d6570736575;/* somepseu */
uint64_t b = 0x646f72616e646f6d;/* dorandom */
uint64_t c = 0x6c7967656e657261;/* lygenera */
uint64_t d = 0x7465646279746573;/* tedbytes */
uint64_t m = 0;
uint8_t padbyte = len;

/*
 * Mix in the 128-bit hash seed.  This is in a format convenient
 * to the ext3/ext4 code.  Please feel free to adapt the
 * */
if (seed) {
m = seed[2] | (uint64_t)seed[3] << 32;
b ^= m;
d ^= m;
m = seed[0] | (uint64_t)seed[1] << 32;
/* a ^= m; is done in loop below */
c ^= m;
}

/*
 * By using the same SipRound code for all iterations, we
 * save space, at the expense of some branch prediction.  But
 * branch prediction is hard because of variable length anyway.
 */
len = len/8 + 3;/* Now number of rounds to perform */
do {
a ^= m;

switch (--len) {
unsigned bytes;

default:/* Full words */
d ^= m = get_unaligned_le64(in);
in += 8;
break;
case 2: /* Final partial word */
/*
 * We'd like to do one 64-bit fetch rather than
 * mess around with bytes, but reading past the end
 * might hit a protection boundary.  Fortunately,
 * we know that protection boundaries are aligned,
 * so we can consider only three cases:
 * - The remainder occupies zero words
 * - The remainder fits into one word
 * - The remainder straddles two words
 */
bytes = padbyte & 7;

if (bytes == 0) {
m = 0;
} else {
unsigned offset = (unsigned)(uintptr_t)in & 7;

if (offset + bytes <= 8) {
m = le64_to_cpup((uint64_t const *)
(in - offset));
m >>= 8*offset;
} else {
m = get_unaligned_le64(in);
}
m &= ((uint64_t)1 << 

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> I already did this. Check my branch.

Do you think it should return "u32" (as you currently have it) or
"unsigned long"?  I thought the latter, since it doesn't cost any
more and makes more 

> I wonder if this could also lead to a similar aliasing
> with arch_get_random_int, since I'm pretty sure all rdrand-like
> instructions return native word size anyway.

Well, Intel's can return 16, 32 or 64 bits, and it makes a
small difference with reseed scheduling.

>> - Ted, Andy Lutorminski and I will try to figure out a construction of
>>   get_random_long() that we all like.

> And me, I hope... No need to make this exclusive.

Gaah, engage brain before fingers.  That was so obvious I didn't say
it, and the result came out sounding extremely rude.

A better (but longer) way to write it would be "I'm sorry that I, Ted,
and Andy are all arguing with you and each other about how to do this
and we can't finalize this part yet".


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> I already did this. Check my branch.

Do you think it should return "u32" (as you currently have it) or
"unsigned long"?  I thought the latter, since it doesn't cost any
more and makes more 

> I wonder if this could also lead to a similar aliasing
> with arch_get_random_int, since I'm pretty sure all rdrand-like
> instructions return native word size anyway.

Well, Intel's can return 16, 32 or 64 bits, and it makes a
small difference with reseed scheduling.

>> - Ted, Andy Lutorminski and I will try to figure out a construction of
>>   get_random_long() that we all like.

> And me, I hope... No need to make this exclusive.

Gaah, engage brain before fingers.  That was so obvious I didn't say
it, and the result came out sounding extremely rude.

A better (but longer) way to write it would be "I'm sorry that I, Ted,
and Andy are all arguing with you and each other about how to do this
and we can't finalize this part yet".


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> 64-bit security for an RNG is not reasonable even with rekeying. No no
> no. Considering we already have a massive speed-up here with the
> secure version, there's zero reason to start weakening the security
> because we're trigger happy with our benchmarks. No no no.

Just to clarify, I was discussing the idea with Ted (who's in charge of
the whole thing, not me), not trying to make any sort of final decision
on the subject.  I need to look at the various users (46 non-trivial ones
for get_random_int, 15 for get_random_long) and see what their security
requirements actually are.

I'm also trying to see if HalfSipHash can be used in a way that gives
slightly more than 64 bits of effective security.

The problem is that the old MD5-based transform had unclear, but
obviously ample, security.  There were 64 bytes of global secret and
16 chaining bytes per CPU.  Adapting SipHash (even the full version)
takes more thinking.

An actual HalfSipHash-based equivalent to the existing code would be:

#define RANDOM_INT_WORDS (64 / sizeof(long))/* 16 or 8 */

static u32 random_int_secret[RANDOM_INT_WORDS]
cacheline_aligned __read_mostly;
static DEFINE_PER_CPU(unsigned long[4], get_random_int_hash)
__aligned(sizeof(unsigned long));

unsigned long get_random_long(void)
{
unsigned long *hash = get_cpu_var(get_random_int_hash);
unsigned long v0 = hash[0], v1 = hash[1], v2 = hash[2], v3 = hash[3];
int i;

/* This could be improved, but it's equivalent */
v0 += current->pid + jiffies + random_get_entropy();

for (i = 0; i < RANDOM_INT_WORDS; i++) {
v3 ^= random_int_secret[i];
HSIPROUND;
HSIPROUND;
v0 ^= random_int_secret[i];
}
/* To be equivalent, we *don't* finalize the transform */

hash[0] = v0; hash[1] = v1; hash[2] = v2; hash[3] = v3;
put_cpu_var(get_random_int_hash);

return v0 ^ v1 ^ v2 ^ v3;
}

I don't think there's a 2^64 attack on that.

But 64 bytes of global secret is ridiculous if the hash function
doesn't require that minimum block size.  It'll take some thinking.


Ths advice I'd give now is:
- Implement
unsigned long hsiphash(const void *data, size_t len, const unsigned long key[2])
  .. as SipHash on 64-bit (maybe SipHash-1-3, still being discussed) and
  HalfSipHash on 32-bit.
- Document when it may or may not be used carefully.
- #define get_random_int (unsigned)get_random_long
- Ted, Andy Lutorminski and I will try to figure out a construction of
  get_random_long() that we all like.


('scuse me for a few hours, I have some unrelated things I really *should*
be working on...)


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> 64-bit security for an RNG is not reasonable even with rekeying. No no
> no. Considering we already have a massive speed-up here with the
> secure version, there's zero reason to start weakening the security
> because we're trigger happy with our benchmarks. No no no.

Just to clarify, I was discussing the idea with Ted (who's in charge of
the whole thing, not me), not trying to make any sort of final decision
on the subject.  I need to look at the various users (46 non-trivial ones
for get_random_int, 15 for get_random_long) and see what their security
requirements actually are.

I'm also trying to see if HalfSipHash can be used in a way that gives
slightly more than 64 bits of effective security.

The problem is that the old MD5-based transform had unclear, but
obviously ample, security.  There were 64 bytes of global secret and
16 chaining bytes per CPU.  Adapting SipHash (even the full version)
takes more thinking.

An actual HalfSipHash-based equivalent to the existing code would be:

#define RANDOM_INT_WORDS (64 / sizeof(long))/* 16 or 8 */

static u32 random_int_secret[RANDOM_INT_WORDS]
cacheline_aligned __read_mostly;
static DEFINE_PER_CPU(unsigned long[4], get_random_int_hash)
__aligned(sizeof(unsigned long));

unsigned long get_random_long(void)
{
unsigned long *hash = get_cpu_var(get_random_int_hash);
unsigned long v0 = hash[0], v1 = hash[1], v2 = hash[2], v3 = hash[3];
int i;

/* This could be improved, but it's equivalent */
v0 += current->pid + jiffies + random_get_entropy();

for (i = 0; i < RANDOM_INT_WORDS; i++) {
v3 ^= random_int_secret[i];
HSIPROUND;
HSIPROUND;
v0 ^= random_int_secret[i];
}
/* To be equivalent, we *don't* finalize the transform */

hash[0] = v0; hash[1] = v1; hash[2] = v2; hash[3] = v3;
put_cpu_var(get_random_int_hash);

return v0 ^ v1 ^ v2 ^ v3;
}

I don't think there's a 2^64 attack on that.

But 64 bytes of global secret is ridiculous if the hash function
doesn't require that minimum block size.  It'll take some thinking.


Ths advice I'd give now is:
- Implement
unsigned long hsiphash(const void *data, size_t len, const unsigned long key[2])
  .. as SipHash on 64-bit (maybe SipHash-1-3, still being discussed) and
  HalfSipHash on 32-bit.
- Document when it may or may not be used carefully.
- #define get_random_int (unsigned)get_random_long
- Ted, Andy Lutorminski and I will try to figure out a construction of
  get_random_long() that we all like.


('scuse me for a few hours, I have some unrelated things I really *should*
be working on...)


Re: [kernel-hardening] Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
An idea I had which mght be useful:

You could perhaps save two rounds in siphash_*u64.

The final word with the length (called "b" in your implementation)
only needs to be there if the input is variable-sized.

If every use of a given key is of a fixed-size input, you don't need
a length suffix.  When the input is an even number of words, that can
save you two rounds.

This requires an audit of callers (e.g. you have to use different
keys for IPv4 and IPv6 ISNs), but can save time.

(This is crypto 101; search "MD-strengthening" or see the remark on
p. 101 on Damgaard's 1989 paper "A design principle for hash functions" at
http://saluc.engr.uconn.edu/refs/algorithms/hashalg/damgard89adesign.pdf
but I'm sure that Ted, Jean-Philippe, and/or DJB will confirm if you'd
like.)

Jason A. Donenfeld wrote:
> Oh, okay, that is exactly what I thought was going on. I just thought
> you were implying that jiffies could be moved inside the hash, which
> then confused my understanding of how things should be. In any case,
> thanks for the explanation.

No, the rekeying procedure is cleverer.

The thing is, all that matters is that the ISN increments fast enough,
but not wrap too soon.

It *is* permitted to change the random base, as long as it only
increases, and slower than the timestamp does.

So what you do is every few minutes, you increment the high 4 bits of the
random base and change the key used to generate the low 28 bits.

The base used for any particular host might change from 0x1000
to 0x2fff, or from 0x1fff to 0x2000, but either way, it's
increasing, and not too fast.

This has the downside that an attacker can see 4 bits of the base,
so only needs to send send 2^28 = 256 MB to flood the connection,
but the upside that the key used to generate the low bits changes
faster than it can be broken.


Re: [kernel-hardening] Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
An idea I had which mght be useful:

You could perhaps save two rounds in siphash_*u64.

The final word with the length (called "b" in your implementation)
only needs to be there if the input is variable-sized.

If every use of a given key is of a fixed-size input, you don't need
a length suffix.  When the input is an even number of words, that can
save you two rounds.

This requires an audit of callers (e.g. you have to use different
keys for IPv4 and IPv6 ISNs), but can save time.

(This is crypto 101; search "MD-strengthening" or see the remark on
p. 101 on Damgaard's 1989 paper "A design principle for hash functions" at
http://saluc.engr.uconn.edu/refs/algorithms/hashalg/damgard89adesign.pdf
but I'm sure that Ted, Jean-Philippe, and/or DJB will confirm if you'd
like.)

Jason A. Donenfeld wrote:
> Oh, okay, that is exactly what I thought was going on. I just thought
> you were implying that jiffies could be moved inside the hash, which
> then confused my understanding of how things should be. In any case,
> thanks for the explanation.

No, the rekeying procedure is cleverer.

The thing is, all that matters is that the ISN increments fast enough,
but not wrap too soon.

It *is* permitted to change the random base, as long as it only
increases, and slower than the timestamp does.

So what you do is every few minutes, you increment the high 4 bits of the
random base and change the key used to generate the low 28 bits.

The base used for any particular host might change from 0x1000
to 0x2fff, or from 0x1fff to 0x2000, but either way, it's
increasing, and not too fast.

This has the downside that an attacker can see 4 bits of the base,
so only needs to send send 2^28 = 256 MB to flood the connection,
but the upside that the key used to generate the low bits changes
faster than it can be broken.


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> What should we do with get_random_int() and get_random_long()?  In
> some cases it's being used in performance sensitive areas, and where
> anti-DoS protection might be enough.  In others, maybe not so much.

This is tricky.  The entire get_random_int() structure is an abuse of
the hash function and will need to be thoroughly rethought to convert
it to SipHash.  Remember, SipHash's security goals are very different
from MD5, so there's no obvious way to do the conversion.

(It's *documented* as "not cryptographically secure", but we know
where that goes.)

> If we rekeyed the secret used by get_random_int() and
> get_random_long() frequently (say, every minute or every 5 minutes),
> would that be sufficient for current and future users of these
> interfaces?

Remembering that on "real" machines it's full SipHash, then I'd say that
64-bit security + rekeying seems reasonable.

The question is, the idea has recently been floated to make hsiphash =
SipHash-1-3 on 64-bit machines.  Is *that* okay?


The annoying thing about the currently proposed patch is that the *only*
chaining is the returned value.  What I'd *like* to do is the same
pattern as we do with md5, and remember v[0..3] between invocations.
But there's no partial SipHash primitive; we only get one word back.

Even
*chaining += ret = siphash_3u64(...)

would be an improvement.

Although we could do something like

c0 = chaining[0];
chaining[0] = c1 = chaining[1];

ret = hsiphash(c0, c1, ...)

chaining[1] = c0 + ret;


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> What should we do with get_random_int() and get_random_long()?  In
> some cases it's being used in performance sensitive areas, and where
> anti-DoS protection might be enough.  In others, maybe not so much.

This is tricky.  The entire get_random_int() structure is an abuse of
the hash function and will need to be thoroughly rethought to convert
it to SipHash.  Remember, SipHash's security goals are very different
from MD5, so there's no obvious way to do the conversion.

(It's *documented* as "not cryptographically secure", but we know
where that goes.)

> If we rekeyed the secret used by get_random_int() and
> get_random_long() frequently (say, every minute or every 5 minutes),
> would that be sufficient for current and future users of these
> interfaces?

Remembering that on "real" machines it's full SipHash, then I'd say that
64-bit security + rekeying seems reasonable.

The question is, the idea has recently been floated to make hsiphash =
SipHash-1-3 on 64-bit machines.  Is *that* okay?


The annoying thing about the currently proposed patch is that the *only*
chaining is the returned value.  What I'd *like* to do is the same
pattern as we do with md5, and remember v[0..3] between invocations.
But there's no partial SipHash primitive; we only get one word back.

Even
*chaining += ret = siphash_3u64(...)

would be an improvement.

Although we could do something like

c0 = chaining[0];
chaining[0] = c1 = chaining[1];

ret = hsiphash(c0, c1, ...)

chaining[1] = c0 + ret;


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
Jason A. Donenfeld wrote:
> I saw that jiffies addition in there and was wondering what it was all
> about. It's currently added _after_ the siphash input, not before, to
> keep with how the old algorithm worked. I'm not sure if this is
> correct or if there's something wrong with that, as I haven't studied
> how it works. If that jiffies should be part of the siphash input and
> not added to the result, please tell me. Otherwise I'll keep things
> how they are to avoid breaking something that seems to be working.

Oh, geez, I didn't realize you didn't understand this code.

Full details at
https://en.wikipedia.org/wiki/TCP_sequence_prediction_attack

But yes, the sequence number is supposed to be (random base) + (timestamp).
In the old days before Canter & Siegel when the internet was a nice place,
people just used a counter that started at boot time.

But then someone observed that I can start a connection to host X,
see the sequence number it gives back to me, and thereby learn the
seauence number it's using on its connections to host Y.

And I can use that to inject forged data into an X-to-Y connection,
without ever seeing a single byte of the traffic!  (If I *can* observe
the traffic, of course, none of this makes the slightest difference.)

So the random base was made a keyed hash of the endpoint identifiers.
(Practically only the hosts matter, but generally the ports are thrown
in for good measure.)  That way, the ISN that host X sends to me
tells me nothing about the ISN it's using to talk to host Y.  Now the
only way to inject forged data into the X-to-Y connection is to
send 2^32 bytes, which is a little less practical.


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
Jason A. Donenfeld wrote:
> I saw that jiffies addition in there and was wondering what it was all
> about. It's currently added _after_ the siphash input, not before, to
> keep with how the old algorithm worked. I'm not sure if this is
> correct or if there's something wrong with that, as I haven't studied
> how it works. If that jiffies should be part of the siphash input and
> not added to the result, please tell me. Otherwise I'll keep things
> how they are to avoid breaking something that seems to be working.

Oh, geez, I didn't realize you didn't understand this code.

Full details at
https://en.wikipedia.org/wiki/TCP_sequence_prediction_attack

But yes, the sequence number is supposed to be (random base) + (timestamp).
In the old days before Canter & Siegel when the internet was a nice place,
people just used a counter that started at boot time.

But then someone observed that I can start a connection to host X,
see the sequence number it gives back to me, and thereby learn the
seauence number it's using on its connections to host Y.

And I can use that to inject forged data into an X-to-Y connection,
without ever seeing a single byte of the traffic!  (If I *can* observe
the traffic, of course, none of this makes the slightest difference.)

So the random base was made a keyed hash of the endpoint identifiers.
(Practically only the hosts matter, but generally the ports are thrown
in for good measure.)  That way, the ISN that host X sends to me
tells me nothing about the ISN it's using to talk to host Y.  Now the
only way to inject forged data into the X-to-Y connection is to
send 2^32 bytes, which is a little less practical.


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
Tom Herbert wrote:
> Tested this. Distribution and avalanche effect are still good. Speed
> wise I see about a 33% improvement over siphash (20 nsecs/op versus 32
> nsecs). That's about 3x of jhash speed (7 nsecs). So that might closer
> to a more palatable replacement for jhash. Do we lose any security
> advantages with halfsiphash?

What are you testing on?  And what input size?  And does "33% improvement"
mean 4/3 the rate and 3/4 the time?  Or 2/3 the time and 3/2 the rate?

These are very odd results.  On a 64-bit machine, SipHash should be the
same speed per round, and faster because it hashes more data per round.
(Unless you're hitting some unexpected cache/decode effect due to REX
prefixes.)

On a 32-bit machine (other than ARM, where your results might make sense,
or maybe if you're hashing large amounts of data), the difference should
be larger.

And yes, there is a *significant* security loss.  SipHash is 128 bits
("don't worry about it").  hsiphash is 64 bits, which is known breakable
("worry about it"), so we have to do a careful analysis of the cost of
a successful attack.

As mentioned in the e-mails that just flew by, hsiphash is intended
*only* for 32-bit machines which bog down on full SipHash.  On all 64-bit
machines, it will be implemented as an alias for SipHash and the security
concerns will Just Go Away.

The place where hsiphash is expected to make a big difference is 32-bit
x86.  If you only see 33% difference with "gcc -m32", I'm going to be
very confused.


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
Tom Herbert wrote:
> Tested this. Distribution and avalanche effect are still good. Speed
> wise I see about a 33% improvement over siphash (20 nsecs/op versus 32
> nsecs). That's about 3x of jhash speed (7 nsecs). So that might closer
> to a more palatable replacement for jhash. Do we lose any security
> advantages with halfsiphash?

What are you testing on?  And what input size?  And does "33% improvement"
mean 4/3 the rate and 3/4 the time?  Or 2/3 the time and 3/2 the rate?

These are very odd results.  On a 64-bit machine, SipHash should be the
same speed per round, and faster because it hashes more data per round.
(Unless you're hitting some unexpected cache/decode effect due to REX
prefixes.)

On a 32-bit machine (other than ARM, where your results might make sense,
or maybe if you're hashing large amounts of data), the difference should
be larger.

And yes, there is a *significant* security loss.  SipHash is 128 bits
("don't worry about it").  hsiphash is 64 bits, which is known breakable
("worry about it"), so we have to do a careful analysis of the cost of
a successful attack.

As mentioned in the e-mails that just flew by, hsiphash is intended
*only* for 32-bit machines which bog down on full SipHash.  On all 64-bit
machines, it will be implemented as an alias for SipHash and the security
concerns will Just Go Away.

The place where hsiphash is expected to make a big difference is 32-bit
x86.  If you only see 33% difference with "gcc -m32", I'm going to be
very confused.


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
>> On a 64-bit machine, 64-bit SipHash is *always* faster than 32-bit, and
>> should be used always.  Don't even compile the 32-bit code, to prevent
>> anyone accidentally using it, and make hsiphash an alias for siphash.

> Fascinating! Okay. So I'll alias hsiphash to siphash on 64-bit then. I
> like this arrangement.

This is a basic assumption I make in the security analysis below:
on most machines, it's 128-bit-key SipHash everywhere and we can
consider security solved.

Our analysis *only* has to consider 32-bit machines.  My big concern
is home routers, with IoT appliances coming second.  The routers have
severe hardware cost constraints (so limited CPU power), but see a lot
of traffic and need to process (NAT) it.

> That's a nice analysis. Might one conclude from that that hsiphash is
> not useful for our purposes? Or does it still remain useful for
> network facing code?

I think for attacks where the threat is a DoS, it's usable.  The point
is you only have to raise the cost to equal that of a packet flood.
(Just like in electronic warfare, the best you can possibly do is force
the enemy to use broadband jamming.)

Hash collision attacks just aren't that powerful.  The original PoC
was against an application that implemented a hard limit on hash chain
length as a DoS defense, which the attack then exploited to turn it into
a hard DoS.

>> Let me consider your second example above, "secure against local users".
>> I should dig through your patchset and find the details, but what exactly
>> are the consequences of such an attack?  Hasn't a local user already
>> got much better ways to DoS the system?

> For example, an unpriv'd user putting lots of entries in one hash
> bucket for a shared resource that's used by root, like filesystems or
> other lookup tables. If he can cause root to use more of root's cpu
> schedule budget than otherwise in a directed way, then that's a bad
> DoS.

This issue was recently discussed when we redesigned the dcache hash.
Even a successful attack doesn't slow things down all *that* much.

Before you overkill every hash table in the kernel, think about whether
it's a bigger problem than the dcache.  (Hint: it's probably not.)
There's no point armor-plating the side door when the front door was
just upgraded from screen to wood.

>> These days, 32-bit CPUs are for embedded applications: network appliances,
>> TVs, etc.  That means basically single-user.  Even phones are 64 bit.
>> Is this really a threat that needs to be defended against?

> I interpret this to indicate all the more reason to alias hsiphash to
> siphash on 64-bit, and then the problem space collapses in a clear
> way.

Yes, exactly.  

> Right. Hence the need for always using full siphash and not hsiphash
> for sequence numbers, per my earlier email to David.
>
>> I wish we could get away with 64-bit security, but given that the
>> modern internet involves attacks from NSA/Spetssvyaz/3PLA, I agree
>> it's just not enough.
>
> I take this comment to be relavent for the sequence number case.

Yes.

> For hashtables and hashtable flooding, is it still your opinion that
> we will benefit from hsiphash? Or is this final conclusion a rejection
> of hsiphash for that too? We're talking about two different use cases,
> and your email kind of interleaved both into your analysis, so I'm not
> certain so to precisely what your conclusion is for each use case. Can
> you clear up the ambiguity?

My (speaking enerally; I should walk through every hash table you've
converted) opinion is that:

- Hash tables, even network-facing ones, can all use hsiphash as long
  as an attacker can only see collisions, i.e. ((H(x) ^ H(y)) & bits) ==
  0, and the consequences of a successful attack is only more collisions
  (timing).  While the attack is only 2x the cost (two hashes rather than
  one to test a key), the knowledge of the collision is statistical,
  especially for network attackers, which raises the cost of guessing
  beyond an even more brute-force attack.
- When the hash value directly visible (e.g. included in a network
  packet), full SipHash should be the default.
- Syncookies *could* use hsiphash, especially as there are
  two keys in there.  Not sure if we need the performance.
- For TCP ISNs, I'd prefer to use full SipHash.  I know this is
  a very hot path, and if that's a performance bottleneck,
  we can work harder on it.

In particular, TCP ISNs *used* to rotate the key periodically,
limiting the time available to an attacker to perform an
attack before the secret goes stale and is useless.  commit
6e5714eaf77d79ae1c8b47e3e040ff5411b717ec upgraded to md5 and dropped
the key rotation.

If 2x hsiphash is faster than siphash, we could use a double-hashing
system like syncookies.  One 32-bit hash with a permanent key, summed
with a k-bit counter and a (32-k)-bit hash, where the key is rotated
(and the counter incremented) periodically.

The requirement is that the increment rate of the counter hash doesn't

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
>> On a 64-bit machine, 64-bit SipHash is *always* faster than 32-bit, and
>> should be used always.  Don't even compile the 32-bit code, to prevent
>> anyone accidentally using it, and make hsiphash an alias for siphash.

> Fascinating! Okay. So I'll alias hsiphash to siphash on 64-bit then. I
> like this arrangement.

This is a basic assumption I make in the security analysis below:
on most machines, it's 128-bit-key SipHash everywhere and we can
consider security solved.

Our analysis *only* has to consider 32-bit machines.  My big concern
is home routers, with IoT appliances coming second.  The routers have
severe hardware cost constraints (so limited CPU power), but see a lot
of traffic and need to process (NAT) it.

> That's a nice analysis. Might one conclude from that that hsiphash is
> not useful for our purposes? Or does it still remain useful for
> network facing code?

I think for attacks where the threat is a DoS, it's usable.  The point
is you only have to raise the cost to equal that of a packet flood.
(Just like in electronic warfare, the best you can possibly do is force
the enemy to use broadband jamming.)

Hash collision attacks just aren't that powerful.  The original PoC
was against an application that implemented a hard limit on hash chain
length as a DoS defense, which the attack then exploited to turn it into
a hard DoS.

>> Let me consider your second example above, "secure against local users".
>> I should dig through your patchset and find the details, but what exactly
>> are the consequences of such an attack?  Hasn't a local user already
>> got much better ways to DoS the system?

> For example, an unpriv'd user putting lots of entries in one hash
> bucket for a shared resource that's used by root, like filesystems or
> other lookup tables. If he can cause root to use more of root's cpu
> schedule budget than otherwise in a directed way, then that's a bad
> DoS.

This issue was recently discussed when we redesigned the dcache hash.
Even a successful attack doesn't slow things down all *that* much.

Before you overkill every hash table in the kernel, think about whether
it's a bigger problem than the dcache.  (Hint: it's probably not.)
There's no point armor-plating the side door when the front door was
just upgraded from screen to wood.

>> These days, 32-bit CPUs are for embedded applications: network appliances,
>> TVs, etc.  That means basically single-user.  Even phones are 64 bit.
>> Is this really a threat that needs to be defended against?

> I interpret this to indicate all the more reason to alias hsiphash to
> siphash on 64-bit, and then the problem space collapses in a clear
> way.

Yes, exactly.  

> Right. Hence the need for always using full siphash and not hsiphash
> for sequence numbers, per my earlier email to David.
>
>> I wish we could get away with 64-bit security, but given that the
>> modern internet involves attacks from NSA/Spetssvyaz/3PLA, I agree
>> it's just not enough.
>
> I take this comment to be relavent for the sequence number case.

Yes.

> For hashtables and hashtable flooding, is it still your opinion that
> we will benefit from hsiphash? Or is this final conclusion a rejection
> of hsiphash for that too? We're talking about two different use cases,
> and your email kind of interleaved both into your analysis, so I'm not
> certain so to precisely what your conclusion is for each use case. Can
> you clear up the ambiguity?

My (speaking enerally; I should walk through every hash table you've
converted) opinion is that:

- Hash tables, even network-facing ones, can all use hsiphash as long
  as an attacker can only see collisions, i.e. ((H(x) ^ H(y)) & bits) ==
  0, and the consequences of a successful attack is only more collisions
  (timing).  While the attack is only 2x the cost (two hashes rather than
  one to test a key), the knowledge of the collision is statistical,
  especially for network attackers, which raises the cost of guessing
  beyond an even more brute-force attack.
- When the hash value directly visible (e.g. included in a network
  packet), full SipHash should be the default.
- Syncookies *could* use hsiphash, especially as there are
  two keys in there.  Not sure if we need the performance.
- For TCP ISNs, I'd prefer to use full SipHash.  I know this is
  a very hot path, and if that's a performance bottleneck,
  we can work harder on it.

In particular, TCP ISNs *used* to rotate the key periodically,
limiting the time available to an attacker to perform an
attack before the secret goes stale and is useless.  commit
6e5714eaf77d79ae1c8b47e3e040ff5411b717ec upgraded to md5 and dropped
the key rotation.

If 2x hsiphash is faster than siphash, we could use a double-hashing
system like syncookies.  One 32-bit hash with a permanent key, summed
with a k-bit counter and a (32-k)-bit hash, where the key is rotated
(and the counter incremented) periodically.

The requirement is that the increment rate of the counter hash doesn't

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> It appears that hsiphash can produce either 32-bit output or 64-bit
> output, with the output length parameter as part of the hash algorithm
> in there. When I code this for my kernel patchset, I very likely will
> only implement one output length size. Right now I'm leaning toward
> 32-bit.

A 128-bit output option was added to SipHash after the initial publication;
this is just the equivalent in 32-bit.

> - Is this a reasonable choice?

Yes.

> - Are there reasons why hsiphash with 64-bit output would be
>   reasonable? Or will we be fine sticking with 32-bit output only?

Personally, I'd put in a comment saying that "there's a 64-bit output
variant that's not implemented" and punt until someone find a need.

> With both hsiphash and siphash, the division of usage will probably become:
> - Use 64-bit output 128-bit key siphash for keyed RNG-like things,
>   such as syncookies and sequence numbers
> - Use 64-bit output 128-bit key siphash for hashtables that must
>   absolutely be secure to an extremely high bandwidth attacker, such as
>   userspace directly DoSing a kernel hashtable
> - Use 32-bit output 64-bit key hsiphash for quick hashtable functions
>   that still must be secure but do not require as large of a security
>   margin.

On a 64-bit machine, 64-bit SipHash is *always* faster than 32-bit, and
should be used always.  Don't even compile the 32-bit code, to prevent
anyone accidentally using it, and make hsiphash an alias for siphash.

On a 32-bit machine, it's a much trickier case.  I'd be tempted to
use the 32-bit code always, but it needs examination.

Fortunately, the cost of brute-forcing hash functions can be fairly
exactly quantified, thanks to bitcoin miners.  It currently takes 2^70
hashes to create one bitcoin block, worth 25 bitcoins ($19,500).  Thus,
2^63 hashes cost $152.

Now, there are two factors that must be considered:
- That's a very very "wholesale" rate.  That's assuming you're doing
  large numbers of these and can put in the up-front effort designing
  silicon ASICs to do the attack.
- That's for a more difficult hash (double sha-256) than SipHash.
  That's a constant fator, but a pretty significant one.  If the wholesale
  assumption holds, that might bring the cost down another 6 or 7 bits,
  to $1-2 per break.

If you're not the NSA and limited to general-purpose silicon, let's
assume a state of the art GPU (Radeon HD 7970; AMD GPUs seem do to better
than nVidia).  The bitcoin mining rate for those is about 700M/second,
29.4 bits.  So 63 bits is 152502 GPU-days, divided by some factor
to account for SipHash's high speed compared to two rounds of SHA-2.
Call it 1000 GPU-days.

It's very doable, but also very non-trivial.  The question is, wouldn't
it be cheaper and easier just to do a brute-force flooding DDoS?

(This is why I wish the key size could be tweaked up to 80 bits.
That would take all these numbers out of the reasonable range.)


Let me consider your second example above, "secure against local users".
I should dig through your patchset and find the details, but what exactly
are the consequences of such an attack?  Hasn't a local user already
got much better ways to DoS the system?

The thing to remember is that we're worried only about the combination
of a *new* Linux kernel (new build or under active maintenance) and a
32-bit host.  You'd be hard-pressed to find a *single* machine fitting
that description which is hosting multiple users or VMs and is not 64-bit.

These days, 32-bit CPUs are for embedded applications: network appliances,
TVs, etc.  That means basically single-user.  Even phones are 64 bit.
Is this really a threat that needs to be defended against?


For your first case, network applications, the additional security
is definitely attractive.  Syncookies are only a DoS, but sequence
numbers are a real security issue; they can let you inject data into a
TCP connection.

Hash tables are much harder to attack.  The information you get back from
timing probes is statistical, and thus testing a key is more expensive.
With sequence numbers, large amounts (32 bits) the hash output is
directly observable.

I wish we could get away with 64-bit security, but given that the
modern internet involves attacks from NSA/Spetssvyaz/3PLA, I agree
it's just not enough.


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread George Spelvin
> It appears that hsiphash can produce either 32-bit output or 64-bit
> output, with the output length parameter as part of the hash algorithm
> in there. When I code this for my kernel patchset, I very likely will
> only implement one output length size. Right now I'm leaning toward
> 32-bit.

A 128-bit output option was added to SipHash after the initial publication;
this is just the equivalent in 32-bit.

> - Is this a reasonable choice?

Yes.

> - Are there reasons why hsiphash with 64-bit output would be
>   reasonable? Or will we be fine sticking with 32-bit output only?

Personally, I'd put in a comment saying that "there's a 64-bit output
variant that's not implemented" and punt until someone find a need.

> With both hsiphash and siphash, the division of usage will probably become:
> - Use 64-bit output 128-bit key siphash for keyed RNG-like things,
>   such as syncookies and sequence numbers
> - Use 64-bit output 128-bit key siphash for hashtables that must
>   absolutely be secure to an extremely high bandwidth attacker, such as
>   userspace directly DoSing a kernel hashtable
> - Use 32-bit output 64-bit key hsiphash for quick hashtable functions
>   that still must be secure but do not require as large of a security
>   margin.

On a 64-bit machine, 64-bit SipHash is *always* faster than 32-bit, and
should be used always.  Don't even compile the 32-bit code, to prevent
anyone accidentally using it, and make hsiphash an alias for siphash.

On a 32-bit machine, it's a much trickier case.  I'd be tempted to
use the 32-bit code always, but it needs examination.

Fortunately, the cost of brute-forcing hash functions can be fairly
exactly quantified, thanks to bitcoin miners.  It currently takes 2^70
hashes to create one bitcoin block, worth 25 bitcoins ($19,500).  Thus,
2^63 hashes cost $152.

Now, there are two factors that must be considered:
- That's a very very "wholesale" rate.  That's assuming you're doing
  large numbers of these and can put in the up-front effort designing
  silicon ASICs to do the attack.
- That's for a more difficult hash (double sha-256) than SipHash.
  That's a constant fator, but a pretty significant one.  If the wholesale
  assumption holds, that might bring the cost down another 6 or 7 bits,
  to $1-2 per break.

If you're not the NSA and limited to general-purpose silicon, let's
assume a state of the art GPU (Radeon HD 7970; AMD GPUs seem do to better
than nVidia).  The bitcoin mining rate for those is about 700M/second,
29.4 bits.  So 63 bits is 152502 GPU-days, divided by some factor
to account for SipHash's high speed compared to two rounds of SHA-2.
Call it 1000 GPU-days.

It's very doable, but also very non-trivial.  The question is, wouldn't
it be cheaper and easier just to do a brute-force flooding DDoS?

(This is why I wish the key size could be tweaked up to 80 bits.
That would take all these numbers out of the reasonable range.)


Let me consider your second example above, "secure against local users".
I should dig through your patchset and find the details, but what exactly
are the consequences of such an attack?  Hasn't a local user already
got much better ways to DoS the system?

The thing to remember is that we're worried only about the combination
of a *new* Linux kernel (new build or under active maintenance) and a
32-bit host.  You'd be hard-pressed to find a *single* machine fitting
that description which is hosting multiple users or VMs and is not 64-bit.

These days, 32-bit CPUs are for embedded applications: network appliances,
TVs, etc.  That means basically single-user.  Even phones are 64 bit.
Is this really a threat that needs to be defended against?


For your first case, network applications, the additional security
is definitely attractive.  Syncookies are only a DoS, but sequence
numbers are a real security issue; they can let you inject data into a
TCP connection.

Hash tables are much harder to attack.  The information you get back from
timing probes is statistical, and thus testing a key is more expensive.
With sequence numbers, large amounts (32 bits) the hash output is
directly observable.

I wish we could get away with 64-bit security, but given that the
modern internet involves attacks from NSA/Spetssvyaz/3PLA, I agree
it's just not enough.


RE: [PATCH v5 2/4] siphash: add Nu{32,64} helpers

2016-12-16 Thread George Spelvin
Jason A. Donenfeld wrote:
> Isn't that equivalent to:
>   v0 = key[0];
>   v1 = key[1];
>   v2 = key[0] ^ (0x736f6d6570736575ULL ^ 0x646f72616e646f6dULL);
>   v3 = key[1] ^ (0x646f72616e646f6dULL ^ 0x7465646279746573ULL);

(Pre-XORing key[] with the first two constants which, if the constants
are random in the first place, can be a no-op.)  Other than the typo
in the v2 line, yes.  If they key is non-public, then you can xor an
arbitrary constant in to both halves to slightly speed up the startup.

(Nits: There's a typo in the v2 line, you don't need to parenthesize
associative operators like xor, and the "ull" suffix is redundant here.)

> Those constants also look like ASCII strings.

They are.  The ASCII is "somepseudorandomlygeneratedbytes".

> What cryptographic analysis has been done on the values?

They're "nothing up my sleeve numbers".

They're arbitrary numbers, and almost any other values would do exactly
as well.  The main properties are:

1) They're different (particulatly v0 != v2 and v1 != v3), and
2) Neither they, nor their xor, is rotationally symmetric like 0x.
   (Because SipHash is mostly rotationally symmetric, broken only by the
   interruption of the carry chain at the msbit, it helps slightly
   to break this up at the beginning.)

Those exact values only matter for portability.  If you don't need anyone
else to be able to compute matching outputs, then you could use any other
convenient constants (like the MD5 round constants).


RE: [PATCH v5 2/4] siphash: add Nu{32,64} helpers

2016-12-16 Thread George Spelvin
Jason A. Donenfeld wrote:
> Isn't that equivalent to:
>   v0 = key[0];
>   v1 = key[1];
>   v2 = key[0] ^ (0x736f6d6570736575ULL ^ 0x646f72616e646f6dULL);
>   v3 = key[1] ^ (0x646f72616e646f6dULL ^ 0x7465646279746573ULL);

(Pre-XORing key[] with the first two constants which, if the constants
are random in the first place, can be a no-op.)  Other than the typo
in the v2 line, yes.  If they key is non-public, then you can xor an
arbitrary constant in to both halves to slightly speed up the startup.

(Nits: There's a typo in the v2 line, you don't need to parenthesize
associative operators like xor, and the "ull" suffix is redundant here.)

> Those constants also look like ASCII strings.

They are.  The ASCII is "somepseudorandomlygeneratedbytes".

> What cryptographic analysis has been done on the values?

They're "nothing up my sleeve numbers".

They're arbitrary numbers, and almost any other values would do exactly
as well.  The main properties are:

1) They're different (particulatly v0 != v2 and v1 != v3), and
2) Neither they, nor their xor, is rotationally symmetric like 0x.
   (Because SipHash is mostly rotationally symmetric, broken only by the
   interruption of the carry chain at the msbit, it helps slightly
   to break this up at the beginning.)

Those exact values only matter for portability.  If you don't need anyone
else to be able to compute matching outputs, then you could use any other
convenient constants (like the MD5 round constants).


  1   2   3   4   5   6   7   8   9   10   >