Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-15 Thread Geert Uytterhoeven
Hi George,

On Fri, Mar 15, 2019 at 4:36 AM George Spelvin  wrote:
> >> swap_bytes / swap_4byte_words / swap_8byte_words
> >> swap_bytes / swap_ints / swap_longs
> >> swap_1 / swap_4 / swap_8
> >> Pistols at dawn?
>
> On Thu, 14 Mar 2019 at 22:59:55 +0300, Andrey Abramov wrote:
> > Yes, in my opinion, swap_bytes / swap_ints / swap_longs are the
> > most readable because we have both swap_ints and swap_longs functions
> > (in one file near each other), so I don't think that there will be
> > any confusion about size.
>
> Yes, that's what I thought.  They're three related but different
> functions, suffixed _bytes, _ints, and _longs.  What could the
> difference possibly be?  And if anyone has any lingering doubts,
> the functions are right there, with exquisitely clear comments.
>
> No to mention where they're used.  Is "is_aligned(base, size, 8)"
> remotely obscure?  Especially in context:
>
> if (is_aligned(base, size, 8))
> swap_func = swap_longs;
> else if (is_aligned(base, size, 4))
> swap_func = swap_ints;
> else
> swap_func = swap_bytes;
>
> What subtle and mysterious code.
>
> > But actually, it doesn't matter which name will you take, because
> > the meaning of each, in my opinion, is obvious enough, so I don't
> > mind about any of these options.
>
> I'm just amazed that this piece of bikeshedding is the most
> contentious thing about the patch series.
>
> I mean, if I'd named them:
> llanfairpwllgwyngyll()
> shravanabelagola()
> zheleznodorozhny()
> or
> peckish()
> esuriant()
> hungry()
> then yes, those would be bad names.
>
> I prefer the shorter _ints and _longs names, but this is just
> not a hill I want to die on.

Argument of least surprise: don't call something a duck if it's not
guaranteed to behave like a duck.

If I read "long", this triggers a warning flag in my head: be careful, this is
32-bit on 32-bit platforms, and 64-bit on 64-bit platforms.

There's a reason the newer I/O ioread{8,16,32} accessors use explicit
sizes, unlike the ancient x86-centric read[bwl]().

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
>> swap_bytes / swap_4byte_words / swap_8byte_words
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>> Pistols at dawn?

On Thu, 14 Mar 2019 at 22:59:55 +0300, Andrey Abramov wrote:
> Yes, in my opinion, swap_bytes / swap_ints / swap_longs are the
> most readable because we have both swap_ints and swap_longs functions
> (in one file near each other), so I don't think that there will be
> any confusion about size.

Yes, that's what I thought.  They're three related but different
functions, suffixed _bytes, _ints, and _longs.  What could the
difference possibly be?  And if anyone has any lingering doubts,
the functions are right there, with exquisitely clear comments.

No to mention where they're used.  Is "is_aligned(base, size, 8)"
remotely obscure?  Especially in context:

if (is_aligned(base, size, 8))
swap_func = swap_longs;
else if (is_aligned(base, size, 4))
swap_func = swap_ints;
else
swap_func = swap_bytes;

What subtle and mysterious code.

> But actually, it doesn't matter which name will you take, because
> the meaning of each, in my opinion, is obvious enough, so I don't
> mind about any of these options.

I'm just amazed that this piece of bikeshedding is the most
contentious thing about the patch series.

I mean, if I'd named them:
llanfairpwllgwyngyll()
shravanabelagola()
zheleznodorozhny()
or
peckish()
esuriant()
hungry()
then yes, those would be bad names.

I prefer the shorter _ints and _longs names, but this is just
not a hill I want to die on.


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread Andrey Abramov
> Pistols at dawn?

> swap_bytes
> swap_4byte_words
> swap_8byte_words
or
>  swap_bytes / swap_ints / swap_longs
>  swap_1 / swap_4 / swap_8

Yes, in my opinion, swap_bytes / swap_ints / swap_longs are the most readable 
because we have both swap_ints and swap_longs functions (in one file near each 
other), so I don't think that there will be any confusion about size. 
But actually, it doesn't matter which name will you take, because the meaning 
of each, in my opinion, is obvious enough, so I don't mind about any of these 
options.

-- 
With Best Regards,
Andrey Abramov


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread Andy Shevchenko
On Thu, Mar 14, 2019 at 11:53:55AM +, George Spelvin wrote:
> On Thu, 14 Mar 2019 at 11:41:26 +0100, Geert Uytterhoeven wrote:
> > On Thu, Mar 14, 2019 at 11:10 AM George Spelvin  wrote:
> >> On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
>  How about one of:
>  swap_bytes / swap_ints / swap_longs
>  swap_1 / swap_4 / swap_8
> >>>
> >>> longs are ambiguous, so I would prefer bit-sized types.
> >>
> >> I already implemented Andrey's suggestions, which were the exact
> >> opposite of yours.
> >>
> >> Pistols at dawn?

> I just want to pick some names and move on.  Since nothing so far seems
> satsify everyone, I'll go with the ponderous but utterly unambiguous:

> swap_bytes
> swap_4byte_words
> swap_8byte_words

Works for me, or if someone insists it might be also bits there.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
On Thu, 14 Mar 2019 at 11:41:26 +0100, Geert Uytterhoeven wrote:
> On Thu, Mar 14, 2019 at 11:10 AM George Spelvin  wrote:
>> On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
 How about one of:
 swap_bytes / swap_ints / swap_longs
 swap_1 / swap_4 / swap_8
>>>
>>> longs are ambiguous, so I would prefer bit-sized types.
>>
>> I already implemented Andrey's suggestions, which were the exact
>> opposite of yours.
>>
>> Pistols at dawn?

I didn't explain the joke because jokes aren't funny if explained,
but just in case, by suggesting a clearly ridiculous method, I was
saying "I have no idea how to resolve this conflict."

> Prepared to fix all future long vs. int bugs?

In the entire kernel?  Or just the one small source file
where these statically scoped helper functions are visible?

In case of uncertainty, the comments and the code are right there
just a few lines away from the one and only call site, so I
don't expect much confusion.

I care a lot about function names when they are exported, but in
this case we're talking about what colour to paint the *inside* of
the bike shed.

I just want to pick some names and move on.  Since nothing so far seems
satsify everyone, I'll go with the ponderous but utterly unambiguous:

swap_bytes
swap_4byte_words
swap_8byte_words


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread Geert Uytterhoeven
Hi George,

On Thu, Mar 14, 2019 at 11:10 AM George Spelvin  wrote:
> On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
> >> How about one of:
> >> swap_bytes / swap_ints / swap_longs
> >> swap_1 / swap_4 / swap_8
> >
> > longs are ambiguous, so I would prefer bit-sized types.
>
> I already implemented Andrey's suggestions, which were the exact
> opposite of yours.
>
> Pistols at dawn?

Prepared to fix all future long vs. int bugs?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
>> Although I'm thinking of:
>>
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>>
>>  (void)base;
>> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>>
>> Any preference?
> I think it would be better.

>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> In my opinion "swap_bytes / swap_ints / swap_longs" are the most readable.


On Thu, 14 Mar 2019 at 11:29:58 +0200, Andy Shevchenko wrote:
> On Sat, Mar 09, 2019 at 03:53:41PM +, l...@sdf.org wrote:
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>> #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  (void)base;
>> #else
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>
>> Any preference?
>
> This one looks better in a sense we don't suppress the warnings when it's
> not needed.

>>> For such primitives that operates on top of an arrays we usually
>>> append 's' to the name. Currently the name is misleading.
>>> 
>>> Perhaps u32s_swap().
>> 
>> I don't worry much about the naming of static helper functions.
>> If they were exported, it would be a whole lot more important!
>> 
>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>
> For signedness we use prefixes; for plural, suffixes. I don't see the point of
> confusion. And this is in use in kernel a lot.
>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> longs are ambiguous, so I would prefer bit-sized types.

I already implemented Andrey's suggestions, which were the exact
opposite of yours.

Pistols at dawn?


 +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>> 
>>> Why #ifdef is better than if (IS_ENABLED()) ?
>> 
>> Because CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is bool and not
>> tristate.  IS_ENABLED tests for 'y' or 'm' but we don't need it
>> for something that's only on or off.
>
> There is IS_BUILTIN(), though it's a common practice to use IS_ENABLED()
> even for boolean options (I think because of naming of the macro).

Well, as I said earlier, #ifdef is the most common form in the kernel.
It's also the shortest to write, and I like the fact that it slightly
simpler.  (Admittedly, "IS_ENABLED" does not take a lot of brain power
to interpret, but it *is* one more macro that might be hiding magic.)

So I'm not inclined to change it without a substantial reason.


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread George Spelvin
On Sat, 09 Mar 2019 at 23:19:49 +0300, Andrey Abramov wrote:
>> Although I'm thinking of:
>>
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>>
>>  (void)base;
>> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>>
>> Any preference?
> I think it would be better.

>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> In my opinion "swap_bytes / swap_ints / swap_longs" are the most readable.


On Thu, 14 Mar 2019 at 11:29:58 +0200, Andy Shevchenko wrote:
> On Sat, Mar 09, 2019 at 03:53:41PM +, l...@sdf.org wrote:
>> static bool __attribute_const__
>> is_aligned(const void *base, size_t size, unsigned char align)
>> {
>>  unsigned char lsbits = (unsigned char)size;
>> #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>  (void)base;
>> #else
>>  lsbits |= (unsigned char)(uintptr_t)base;
>> #endif
>>  return (lsbits & (align - 1)) == 0;
>> }
>
>> Any preference?
>
> This one looks better in a sense we don't suppress the warnings when it's
> not needed.

>>> For such primitives that operates on top of an arrays we usually
>>> append 's' to the name. Currently the name is misleading.
>>> 
>>> Perhaps u32s_swap().
>> 
>> I don't worry much about the naming of static helper functions.
>> If they were exported, it would be a whole lot more important!
>> 
>> I find "u32s" confusing; I keep reading the "s" as "signed" rather
>> than a plural.
>
> For signedness we use prefixes; for plural, suffixes. I don't see the point of
> confusion. And this is in use in kernel a lot.
>
>> How about one of:
>> swap_bytes / swap_ints / swap_longs
>> swap_1 / swap_4 / swap_8
>
> longs are ambiguous, so I would prefer bit-sized types.

I already implemented Andrey's suggestions, which were the exact
opposite of yours.

Pistols at dawn?


 +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>> 
>>> Why #ifdef is better than if (IS_ENABLED()) ?
>> 
>> Because CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is bool and not
>> tristate.  IS_ENABLED tests for 'y' or 'm' but we don't need it
>> for something that's only on or off.
>
> There is IS_BUILTIN(), though it's a common practice to use IS_ENABLED()
> even for boolean options (I think because of naming of the macro).

Well, as I said earlier, #ifdef is the most common form in the kernel.
It's also the shortest to write, and I like the fact that it slightly
simpler.  (Admittedly, "IS_ENABLED" does not take a lot of brain power
to interpret, but it *is* one more macro that might be hiding magic.)

So I'm not inclined to change it without a substantial reason.


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-14 Thread Andy Shevchenko
On Sat, Mar 09, 2019 at 03:53:41PM +, l...@sdf.org wrote:
> Andy Shevchenko wrote:
> > On Thu, Feb 21, 2019 at 06:30:28AM +, George Spelvin wrote:
> >> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > 
> > Why #ifdef is better than if (IS_ENABLED()) ?
> 
> Because CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is bool and not
> tristate.  IS_ENABLED tests for 'y' or 'm' but we don't need it
> for something that's only on or off.

There is IS_BUILTIN(), though it's a common practice to use IS_ENABLED() even
for boolean options (I think because of naming of the macro).

> Looking through the kernel, I see both, but #ifdef or #if defined()
> are definitely in the majority:
> 
> lib/lzo/lzo1x_compress.c:#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) 
> && defined(LZO_USE_CTZ64)
> lib/siphash.c:#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> lib/string.c:#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> lib/strncpy_from_user.c:#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> lib/zlib_inflate/inffast.c:#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> 
> I see a few IS_ENABLED uses in include/crypto/ and kernel/bpf/.
> 
> It makes no real difference; #ifdef is simpler to me.


> static bool __attribute_const__
> is_aligned(const void *base, size_t size, unsigned char align)
> {
>   unsigned char lsbits = (unsigned char)size;
> #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>   (void)base;
> #else
>   lsbits |= (unsigned char)(uintptr_t)base;
> #endif
>   return (lsbits & (align - 1)) == 0;
> }

> Any preference?

This one looks better in a sense we don't suppress the warnings when it's not
needed.

> > For such primitives that operates on top of an arrays we usually append 's' 
> > to
> > the name. Currently the name is misleading.
> > 
> > Perhaps u32s_swap().
> 
> I don't worry much about the naming of static helper functions.
> If they were exported, it would be a whole lot more important!
> 
> I find "u32s" confusing; I keep reading the "s" as "signed" rather
> than a plural.

For signedness we use prefixes, for plural — suffixes. I don't see the point of
confusion. And this is in use in kernel a lot.

> How about one of:
> swap_bytes / swap_ints / swap_longs
> swap_1 / swap_4 / swap_8

longs are ambiguous, so I would prefer bit-sized types.

> > Shouldn't simple memcpy cover these case for both 32- and 64-bit 
> > architectures?
> 
> This isn't a memcpy, it's a memory *swap*.  To do it with memcpy
> requires:
>   memcpy(temp_buffer, a, size);
>   memcpy(a, b, size);
>   memcpy(b, temp_buffer, size);
> 
> This is 1.5x as much memory access, and you have to find a large
> enough temp_buffer.  (I didn't think a variable-length array on
> the stack would make people happy.)
> 
> Also, although it is a predictable branch, memcpy() has to check the
> alignment of its inputs each call.  The reason for these helpers is
> to factor that out.

Makes sense.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-13 Thread George Spelvin
Thank you for your thoughtful comments!

On Wed, 13 Mar 2019 at 23:23:44 +0100, Rasmus Villemoes wrote:
> On 21/02/2019 07.30, George Spelvin wrote:
> + * @align: required aignment (typically 4 or 8)
>
> typo aLignment

Thanks; fixed!

>> + * Returns true if elements can be copied using word loads and stores.
>> + * The size must be a multiple of the alignment, and the base address must
>> + * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
>
> I wonder if we shouldn't unconditionally require the base to be aligned.
> It of course depends on what exactly 'efficient' means, but if the base
> is not aligned, we know that every single load and store will be
> unaligned, and we're doing lots of them. IIRC even on x86 it's usually
> two cycles instead of one, and even more when we cross a cache line
> boundary (which we do rather often - e.g. in your 40 byte example,
> almost every swap will hit at least one). One could also have some data
> that is naturally 4-byte aligned with an element size that happens to be
> a multiple of 8 bytes; it would be a bit sad to use 8-byte accesses when
> 4-byte would all have been aligned.

Well, a 2-cycle unaligned access is still lots faster than 4 byte accesses.
I think that's a decent interpretation of "EFFICIENT_UNALIGNED_ACCESS":
faster than doing it a byte at a time.  So the change is a net win;
we're just wondering if it could be optimized more.

The 4/8 issue is similar, but not as extreme.  Is one unaligned 64-bit
access faster than two aligned 32-bit accesses?  As you note, it's usually
only twice the cost, so it's no slower, and there's less loop overhead.
So I don't think it's a bad thing here, either.

Re your comments on cache lines, ISTR that x86 can do an unaligned
load with minimal penalty as long as it doesn't cross a cache line:
https://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses/
So you win on most of the accesses, hopefully enough to pay for the
one unligned access.  Apparently in processors more recent than
the P4 example Intel used above, it's even less:
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

On 32-bit machines, it's actually a 4-byte swap, unrolled twice;
there are no 64-bit memory accesses.  So the concern is only about
8-byte alignment on 64-bit machines.

The great majority of call sites sort structures with pointer or
long members, so are aligned and the question is moot.  I don't
think it's worth overthinking the issue on behalf of the performance
of some rare corner cases.

I have considered doing away with the two word-copy loops and just
having one machine-word-sized loop plus a byte fallback.

>> +unsigned int lsbits = (unsigned int)size;
>
> Drop that cast.

Actually, in response to an earlier comment, I changed it (and the
type of lsbits) to (unsigned char), to emphasize that I *really* only
care about a handful of low bits.

I know gcc doesn't warn about implicit narrowing casts, but
I prefer to make them explicit for documentation reasons.
"Yes, I really mean to throw away the high bits."

Compilers on machines without native byte operations are very good
at using larger registers and optimizing away mask operations.
(Remember that this is inlined, and "align" is a constant 4 or 8.)
>
>> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> +(void)base;
>> +#else
>> +lsbits |= (unsigned int)(size_t)base;
>
> The kernel usually casts pointers to long or unsigned long. If some
> oddball arch has size_t something other than unsigned long, this would
> give a "warning: cast from pointer to integer of different size". So
> just cast to unsigned long, and drop the cast to unsigned int.

I changed this to uintptr_t and called it good.

>>  static void u32_swap(void *a, void *b, int size)
>>  {
>> -u32 t = *(u32 *)a;
>> -*(u32 *)a = *(u32 *)b;
>> -*(u32 *)b = t;
>> +size_t n = size;
>> +
>
> Since the callback has int in its prototype (we should fix that globally
> at some point...), might as well use a four-byte local variable, to
> avoid rex prefix on x86-64. So just make that "unsigned" or "u32".

Agreed about the eventual global fix.  If you care, the only places
a non-NULL swap function is passed in are:
- arch/arc/kernel/unwind.c
- arch/powerpc/kernel/module_32.c
- arch/powerpc/kernel/module_64.c
! arch/x86/kernel/unwind_orc.c (2x)
- fs/ocfs2/dir.c
- fs/ocfs2/refcounttree.c (3x)
- fs/ocfs2/xattr.c (3x)
- fs/ubifs/find.c
! kernel/jump_label.c
! lib/extable.c

The ones marked with "-" are simple memory swaps that could (should!)
be replaced with NULL.  The ones marked with "!" actually do
something non-trivial.


Actually, I deliberately used a pointer-sized index, since the loop
uses (a + n) addressing.  Hardware indexed addressing on 64-bit
machines generally (I vaguely recall Aarch64 is an exception)
requires a 64-bit index; using a smaller index requires some masking
or sign-extending.

I thought it was safer to bet that the c

Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-13 Thread Geert Uytterhoeven
On Wed, Mar 13, 2019 at 10:23 PM Rasmus Villemoes
 wrote:
> On 21/02/2019 07.30, George Spelvin wrote:
> > +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > + (void)base;
> > +#else
> > + lsbits |= (unsigned int)(size_t)base;
>
> The kernel usually casts pointers to long or unsigned long. If some
> oddball arch has size_t something other than unsigned long, this would
> give a "warning: cast from pointer to integer of different size". So
> just cast to unsigned long, and drop the cast to unsigned int.

The proper integer equivalent of a pointer is uintptr_t, so please use
that instead of size_t.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-13 Thread Rasmus Villemoes
On 21/02/2019 07.30, George Spelvin wrote:
> Rather than u32_swap and u64_swap working on 4- and 8-byte objects
> directly, let them handle any multiple of 4 or 8 bytes.  This speeds
> up most users of sort() by avoiding fallback to the byte copy loop.
> 
> Despite what commit ca96ab859ab4 ("lib/sort: Add 64 bit swap function")
> claims, very few users of sort() sort pointers (or pointer-sized
> objects); most sort structures containing at least two words.
> (E.g. drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte
> struct acpi_fan_fps.)
> 
> x86-64 code size 872 -> 885 bytes (+8)
> 
> Signed-off-by: George Spelvin 
> ---
>  lib/sort.c | 117 +++--
>  1 file changed, 96 insertions(+), 21 deletions(-)
> 
> diff --git a/lib/sort.c b/lib/sort.c
> index d6b7a202b0b6..dff2ab2e196e 100644
> --- a/lib/sort.c
> +++ b/lib/sort.c
> @@ -11,35 +11,110 @@
>  #include 
>  #include 
>  
> -static int alignment_ok(const void *base, int align)
> +/**
> + * alignment_ok - is this pointer & size okay for word-wide copying?
> + * @base: pointer to data
> + * @size: size of each element
> + * @align: required aignment (typically 4 or 8)
> + *

typo aLignment

> + * Returns true if elements can be copied using word loads and stores.
> + * The size must be a multiple of the alignment, and the base address must
> + * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
> + *

I wonder if we shouldn't unconditionally require the base to be aligned.
It of course depends on what exactly 'efficient' means, but if the base
is not aligned, we know that every single load and store will be
unaligned, and we're doing lots of them. IIRC even on x86 it's usually
two cycles instead of one, and even more when we cross a cache line
boundary (which we do rather often - e.g. in your 40 byte example,
almost every swap will hit at least one). One could also have some data
that is naturally 4-byte aligned with an element size that happens to be
a multiple of 8 bytes; it would be a bit sad to use 8-byte accesses when
4-byte would all have been aligned.

> + * For some reason, gcc doesn't know to optimize "if (a & mask || b & mask)"
> + * to "if ((a | b) & mask)", so we do that by hand.
> + */
> +static bool __attribute_const__
> +alignment_ok(const void *base, size_t size, unsigned int align)
>  {
> - return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
> - ((unsigned long)base & (align - 1)) == 0;
> + unsigned int lsbits = (unsigned int)size;

Drop that cast.

> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> + (void)base;
> +#else
> + lsbits |= (unsigned int)(size_t)base;

The kernel usually casts pointers to long or unsigned long. If some
oddball arch has size_t something other than unsigned long, this would
give a "warning: cast from pointer to integer of different size". So
just cast to unsigned long, and drop the cast to unsigned int.

If you do drop CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, this all becomes
much simpler, of course

lsbits = size | (unsigned long)base;

> +#endif
> + lsbits &= align - 1;
> + return lsbits == 0;
>  }
>  
> +/**
> + * u32_swap - swap two elements in 32-bit chunks
> + * @a, @b: pointers to the elements
> + * @size: element size (must be a multiple of 4)
> + *
> + * Exchange the two objects in memory.  This exploits base+index addressing,
> + * which basically all CPUs have, to minimize loop overhead computations.
> + *
> + * For some reason, on x86 gcc 7.3.0 adds a redundant test of n at the
> + * bottom of the loop, even though the zero flag is stil valid from the
> + * subtract (since the intervening mov instructions don't alter the flags).
> + * Gcc 8.1.0 doesn't have that problem.
> + */
>  static void u32_swap(void *a, void *b, int size)
>  {
> - u32 t = *(u32 *)a;
> - *(u32 *)a = *(u32 *)b;
> - *(u32 *)b = t;
> + size_t n = size;
> +

Since the callback has int in its prototype (we should fix that globally
at some point...), might as well use a four-byte local variable, to
avoid rex prefix on x86-64. So just make that "unsigned" or "u32".

> + do {
> + u32 t = *(u32 *)(a + (n -= 4));
> + *(u32 *)(a + n) = *(u32 *)(b + n);
> + *(u32 *)(b + n) = t;
> + } while (n);
>  }

Hm, are we absolutely sure there's no weird .config where some struct
ends up being empty, yet we still sort an array of them? It's far
fetched, of course, but perhaps an "if (unlikely(!size)) return;" at the
start of sort() would be in order. Dunno, I'll leave it to you.

Rasmus


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-09 Thread George Spelvin
Andy Shevchenko wrote:
> Shouldn't simple memcpy cover these case for both 32- and 64-bit 
> architectures?

Speaking of replacing swap with copying via temporary buffers, one
idea that did come to mind was avoiding swap for sufficiently small
objects.

Every sift-down is actually a circular shift.  Once the target
position hs been found, we rotate the path from the root to the
target up one, with the target filled in by the previous root.

(When breaking down the heap, there's one additional link in the
cycle, where the previous root goes to the end of the heap and the
end of the heap goes to the target.)

If we had a temporary buffer (128 bytes would handle most things),
we could copy the previous root there and *copy*, rather than swap,
the elements on the path to the target up, then finally copy the
previous root to the target.

However, to rotate up, the this must be done in top-down order.
The way the code currently works with premultiplied offsets, it's
easy to traverse bottom-up, but very awkward to retrace the
top-down path.

The only solution I can think of is to build a bitmap of the
left/right turnings from the root to the leaf, and then back it up
to the target.  There are a few ways to do this:

1) The obvious big-endian order.  The bitmap is simply the 1-based
   position of the leaf. To add a level, shift left and add the new
   bit at the bottom.  To back up a step, shift right.

   To retrace, create a 1-bit mask equal to the msbit of the index
   ((smear_right(x) >> 1) + 1) and repeatedly shift the mask right.

2) The reverse order.  We use a 1-bit mask while building the
   bitmap, and while retracing, just examine the lsbit while shifting
   the bitmap right.

3) As option 1, but don't build the bitmap as we're walking down;
   rather reconstruct it from the premultiplied offset using
   reciprocal_divide().

Nothing really jumps out to me as The Right Way to do it.

I don't want to bloat the code to the point that it would be
easier to implement a different algorithm entirely.


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-09 Thread Andrey Abramov
> Although I'm thinking of:
>
> static bool __attribute_const__
> is_aligned(const void *base, size_t size, unsigned char align)
> {
> unsigned char lsbits = (unsigned char)size;
>
> (void)base;
> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> lsbits |= (unsigned char)(uintptr_t)base;
> #endif
> return (lsbits & (align - 1)) == 0;
> }
>
> Any preference?
I think it would be better.

> I find "u32s" confusing; I keep reading the "s" as "signed" rather than a 
> plural. How about one of: swap_bytes / swap_ints / swap_longs swap_1 / swap_4 
> / swap_8
In my opinion "swap_bytes / swap_ints / swap_longs" are the most readable.

(Good job)


Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-09 Thread lkml
Thnk you for the feedback!

Andy Shevchenko wrote:
> On Thu, Feb 21, 2019 at 06:30:28AM +, George Spelvin wrote:
>> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> 
> Why #ifdef is better than if (IS_ENABLED()) ?

Because CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is bool and not
tristate.  IS_ENABLED tests for 'y' or 'm' but we don't need it
for something that's only on or off.

Looking through the kernel, I see both, but #ifdef or #if defined()
are definitely in the majority:

lib/lzo/lzo1x_compress.c:#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && 
defined(LZO_USE_CTZ64)
lib/siphash.c:#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
lib/string.c:#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
lib/strncpy_from_user.c:#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
lib/zlib_inflate/inffast.c:#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

I see a few IS_ENABLED uses in include/crypto/ and kernel/bpf/.

It makes no real difference; #ifdef is simpler to me.

>> +(void)base;
> 
> I understand the possible weirdness of the case, but 0 pointer is also good, 
> no?

I don't quite understand the question.  A NULL pointer is aligned
as far as alignment_ok is concerned.  In other words, word accesses to
*NULL will (not) work just as well as byte accesses.

None of this matters because the callers never pass in NULL pointers.

>> +#else
>> +lsbits |= (unsigned int)(size_t)base;
> 
> Perhaps you meant simple (uintptr_t) here and maybe above as well.

No, I meant to truncate it to the same type as "align".  We only
care about the low few bits; I could have used (unsigned char) just
as well.

My reason or choosing unsigned int rather than unsigned char was
that both x86 and ARM are a tiny bit more efficient at 32-bit
operations (x86 avoids a REX prefix; ARM gates off the high half
of the ALU to save power), but only x86 does byte operations
natively.

Still, compilers know how to promote to word operations, so
maybe using unsigned char would make it clearer to the
reader that "yes, I meant to do that!".

I've changed it to:

static bool __attribute_const__
is_aligned(const void *base, size_t size, unsigned char align)
{
unsigned char lsbits = (unsigned char)size;
#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
(void)base;
#else
lsbits |= (unsigned char)(uintptr_t)base;
#endif
return (lsbits & (align - 1)) == 0;
}

Although I'm thinking of:

static bool __attribute_const__
is_aligned(const void *base, size_t size, unsigned char align)
{
unsigned char lsbits = (unsigned char)size;

(void)base;
#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
lsbits |= (unsigned char)(uintptr_t)base;
#endif
return (lsbits & (align - 1)) == 0;
}

Any preference?

>> +#endif
>> +lsbits &= align - 1;
>> +return lsbits == 0;
> 
> and this can be one return operation.

It was just to avoid some annoying parentheses because C's precendence
rules and GCC's warnings are fighting me here.  If others prefer
"return (lsbits & (align - 1)) == 0;" I'm happy to change.

>>  static void u32_swap(void *a, void *b, int size)
> 
> For such primitives that operates on top of an arrays we usually append 's' to
> the name. Currently the name is misleading.
> 
> Perhaps u32s_swap().

I don't worry much about the naming of static helper functions.
If they were exported, it would be a whole lot more important!

I find "u32s" confusing; I keep reading the "s" as "signed" rather
than a plural.

How about one of:
swap_bytes / swap_ints / swap_longs
swap_1 / swap_4 / swap_8

> Shouldn't simple memcpy cover these case for both 32- and 64-bit 
> architectures?

This isn't a memcpy, it's a memory *swap*.  To do it with memcpy
requires:
memcpy(temp_buffer, a, size);
memcpy(a, b, size);
memcpy(b, temp_buffer, size);

This is 1.5x as much memory access, and you have to find a large
enough temp_buffer.  (I didn't think a variable-length array on
the stack would make people happy.)

Also, although it is a predictable branch, memcpy() has to check the
alignment of its inputs each call.  The reason for these helpers is
to factor that out.