Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-11 Thread Eric Botcazou
> I meant "anti-commutativity"; sorry about the mixup. 

symmetry/antisymmetry is more appropriate for binary relations (commutativity 
is for binary operations).

-- 
Eric Botcazou


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-11 Thread Segher Boessenkool
On Fri, May 11, 2018 at 03:03:09PM +0300, Alexander Monakov wrote:
> On Fri, 11 May 2018, Segher Boessenkool wrote:
> > > In general such address-based comparisons steps are invalid; they lack
> > > anti-reflexivity. So comparators are not actually allowed to do that.
> > 
> > I don't know what you mean here?  Every comparator is required to be
> > *reflexive*!  Subtracting two pointers to elements of the same array gives
> > a perfectly fine total order as far as I see (it is the same as the
> > difference between the array indices of those elements).
> 
> I meant "anti-commutativity"; sorry about the mixup. Such strategy does not
> give a valid total order because the order changes while in-progress sorting
> reorders elements:

Ah.  Right.  C11 7.22.5/4.  Sorry for being dense, I'll drink more coffee.


Segher


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-11 Thread Alexander Monakov
On Fri, 11 May 2018, Segher Boessenkool wrote:
> > In general such address-based comparisons steps are invalid; they lack
> > anti-reflexivity. So comparators are not actually allowed to do that.
> 
> I don't know what you mean here?  Every comparator is required to be
> *reflexive*!  Subtracting two pointers to elements of the same array gives
> a perfectly fine total order as far as I see (it is the same as the
> difference between the array indices of those elements).

I meant "anti-commutativity"; sorry about the mixup. Such strategy does not
give a valid total order because the order changes while in-progress sorting
reorders elements:

suppose you have elements A and B such that A precedes B and compare A < B
via address test. At some point, the sort routine may swap A with some
unrelated element C, moving A past B. Now B < A. This is a contradiction.

> In any case, every qsort call that is converted to one with this weaker
> contract will need to be checked.  Or we can wait for bug reports ;-)

I think such comparators have a good chance of failing existing qsort_chk
validation.

Alexander


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-11 Thread Segher Boessenkool
On Fri, May 11, 2018 at 01:35:00PM +0300, Alexander Monakov wrote:
> On Fri, 11 May 2018, Segher Boessenkool wrote:
> > For example from reload1.c:
> 
> No, this example is not relevant, because ...
> 
> > static int
> > reload_reg_class_lower (const void *r1p, const void *r2p)
> > {
> >   int r1 = *(const short *) r1p, r2 = *(const short *) r2p;
> > 
> > // ...
> > 
> >   return r1 - r2;
> 
> this subtracts integers, not pointers.

Oh whoops, I read that too fast.  Sorry.

> In general such address-based comparisons steps are invalid; they lack
> anti-reflexivity. So comparators are not actually allowed to do that.

I don't know what you mean here?  Every comparator is required to be
*reflexive*!  Subtracting two pointers to elements of the same array gives
a perfectly fine total order as far as I see (it is the same as the
difference between the array indices of those elements).

In any case, every qsort call that is converted to one with this weaker
contract will need to be checked.  Or we can wait for bug reports ;-)


Segher


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-11 Thread Jakub Jelinek
On Fri, May 11, 2018 at 05:29:05AM -0500, Segher Boessenkool wrote:
> Such a comparator will still work with temporary arrays if both elements
> are always in the same temp array (in deterministic order, etc.) array; but
> it won't work if the comparator did e.g.
> 
>   return (r1 - reload_order) - (r2 - reload_order);
> 
> (it would be undefined behaviour, to start with).

Yeah, UB, but likely would still "work".
What wouldn't work is if a comparator assumes certain sizes of the array
and e.g. stores the difference r1 - reload_order in char/short/int
because all callers guarantee that nmemb * size fits into such types.

Jakub


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-11 Thread Alexander Monakov
On Fri, 11 May 2018, Segher Boessenkool wrote:
> For example from reload1.c:

No, this example is not relevant, because ...

> static int
> reload_reg_class_lower (const void *r1p, const void *r2p)
> {
>   int r1 = *(const short *) r1p, r2 = *(const short *) r2p;
> 
> // ...
> 
>   return r1 - r2;

this subtracts integers, not pointers.

In general such address-based comparisons steps are invalid; they lack
anti-reflexivity. So comparators are not actually allowed to do that.

Alexander


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-11 Thread Segher Boessenkool
On Thu, May 10, 2018 at 07:40:57PM +0200, Richard Biener wrote:
> On May 10, 2018 5:56:39 PM GMT+02:00, Alexander Monakov  
> wrote:
> >/* This implements a sort function suitable for GCC use cases:
> >   - signature-compatible to C qsort, but relaxed contract:
> > - may apply the comparator to elements in a temporary buffer
> 
> What consequences has this or rather how is this observable and makes 
> comparators behave differently? 

A comparison function is allowed to compare the addresses of the elements,
or compute what should be the index into the array that is sorted from the
pointers to the elements, etc.  That will not work if the pointers to the
elements do not actually point into the original array.

For example from reload1.c:

static int
reload_reg_class_lower (const void *r1p, const void *r2p)
{
  int r1 = *(const short *) r1p, r2 = *(const short *) r2p;

// ...

  return r1 - r2;
}

(called as
static short reload_order[MAX_RELOADS];
// ...

  qsort (reload_order, n_reloads, sizeof (short), reload_reg_class_lower);
).

Such a comparator will still work with temporary arrays if both elements
are always in the same temp array (in deterministic order, etc.) array; but
it won't work if the comparator did e.g.

  return (r1 - reload_order) - (r2 - reload_order);

(it would be undefined behaviour, to start with).


Segher


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-10 Thread Alexander Monakov
On Thu, 10 May 2018, Richard Biener wrote:
> >   - signature-compatible to C qsort, but relaxed contract:
> > - may apply the comparator to elements in a temporary buffer
> 
> What consequences has this or rather how is this observable and makes 
> comparators behave differently? 

The only serious consequence I'm aware of is this:

the buffer must be sufficiently aligned to match the alignment requirement
of elements being sorted

Alexander


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-10 Thread Alexander Monakov
On Thu, 10 May 2018, Jakub Jelinek wrote:
> Have you gathered some statistics on the element sizes and how often they
> appear in qsort calls (perhaps weighted by n*log(n) of the element count)
> across bootstrap+regtest?

No, but Adhemerval Zanella collected stats on bootstrap, and they are similar
to my observations: https://sourceware.org/ml/libc-alpha/2018-01/msg00629.html

> glibc uses indirect sorting (sorts pointers to the elements or indexes and
> just reshuffles at the end) and has special case for the most commonly used
> small element size (4/8).  With C++ templates you could achieve that even
> without macros, just by instantiating the mergesort and its helpers for the
> few common cases.  Or is that not worth it (as in, we never sort really
> large (say > 32 bytes) element sizes and the 4 or 8 bytes long element sizes
> aren't common enough to see benefit by using constant size memcpy for those
> cases?

I think it's not worth it as branches by element size are not too frequent,
off the critical path, and easy for predictors. So doing it via templates
would cause significant code growth for no speed gain.

As for indirect sorting, we rarely sort elements of size other than 4/8, so
I believe that's not worth it either.

Alexander


Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-10 Thread Richard Biener
On May 10, 2018 5:56:39 PM GMT+02:00, Alexander Monakov  
wrote:
>Hello.
>
>This introduces a replacement for qsort() in GCC. The main selling
>point is
>reproducibility (currently compiler output may change depending on how
>libc
>qsort reorders not-bitwise-identical elements that compare equal) with
>a 
>small improvement speed-wise and small code growth (under 2K on
>x86-64).
>
>The opening comment in sort.cc gives a brief implementation overview:
>
>/* This implements a sort function suitable for GCC use cases:
>   - signature-compatible to C qsort, but relaxed contract:
> - may apply the comparator to elements in a temporary buffer

What consequences has this or rather how is this observable and makes 
comparators behave differently? 

Otherwise thanks for doing this. Will review tomorrow. 

Richard. 

> - may abort on allocation failure
>   - deterministic (but not necessarily stable)
>   - fast, especially for common cases (0-5 elements of size 8 or 4)
>
>   The implementation uses a network sort for up to 5 elements and
>  a merge sort on top of that.  Neither stage has branches depending on
>comparator result, trading extra arithmetic for branch mispredictions. 
>*/
>
>I used a Sandy Bridge CPU to collect statistics on tramp3d -O2
>compilation.
>
>Overall the new implementation is roughly 30% faster compared to Glibc
>qsort,
>with 2x or more speedup for cases with tiny element count. I see one
>instance
>where the new approach is significantly (1.5x) slower: it is ipa-icf.c:
>sort_congruence_class_groups_by_decl_uid. It sorts a big array (1500
>entries)
>and needs 14 indirect loads just to reach values to compare, so when
>branch
>prediction manages to guess correctly, it allows to overlap execution
>of two
>comparators and better hide their cache miss latency.
>
>Overall GCC spends about 0.3% time under qsort, but this doesn't
>automatically
>mean that this brings a 0.1% speed improvement: it may be larger or
>smaller
>depending on how new code affects cache behavior and branch predictors
>in
>other code, and it's not trivial to measure precisely.
>
>I can go into more detail about measured stats if there's interest :)
>
>Makefile.in changes are separated to patch 2 in the hope it'd make
>review
>easier, but the two patches will need to be applied together.
>
>Bootstrapped/regtested on x86-64, OK for trunk?
>
>Alexander Monakov (2):
>  gcc_qsort: source code changes
>  gcc_qsort: build system changes
>
> gcc/Makefile.in |   9 ++-
>gcc/sort.cc | 232
>
> gcc/system.h|   7 +-
> gcc/vec.c   |   2 +-
> 4 files changed, 243 insertions(+), 7 deletions(-)
> create mode 100644 gcc/sort.cc



Re: [PATCH 0/2] Introduce gcc_qsort

2018-05-10 Thread Jakub Jelinek
On Thu, May 10, 2018 at 06:56:39PM +0300, Alexander Monakov wrote:
> Overall the new implementation is roughly 30% faster compared to Glibc qsort,
> with 2x or more speedup for cases with tiny element count. I see one instance
> where the new approach is significantly (1.5x) slower: it is ipa-icf.c:
> sort_congruence_class_groups_by_decl_uid. It sorts a big array (1500 entries)
> and needs 14 indirect loads just to reach values to compare, so when branch
> prediction manages to guess correctly, it allows to overlap execution of two
> comparators and better hide their cache miss latency.
> 
> Overall GCC spends about 0.3% time under qsort, but this doesn't automatically
> mean that this brings a 0.1% speed improvement: it may be larger or smaller
> depending on how new code affects cache behavior and branch predictors in
> other code, and it's not trivial to measure precisely.
> 
> I can go into more detail about measured stats if there's interest :)
> 
> Makefile.in changes are separated to patch 2 in the hope it'd make review
> easier, but the two patches will need to be applied together.
> 
> Bootstrapped/regtested on x86-64, OK for trunk?

Have you gathered some statistics on the element sizes and how often they
appear in qsort calls (perhaps weighted by n*log(n) of the element count)
across bootstrap+regtest?

glibc uses indirect sorting (sorts pointers to the elements or indexes and
just reshuffles at the end) and has special case for the most commonly used
small element size (4/8).  With C++ templates you could achieve that even
without macros, just by instantiating the mergesort and its helpers for the
few common cases.  Or is that not worth it (as in, we never sort really
large (say > 32 bytes) element sizes and the 4 or 8 bytes long element sizes
aren't common enough to see benefit by using constant size memcpy for those
cases?

Jakub


[PATCH 0/2] Introduce gcc_qsort

2018-05-10 Thread Alexander Monakov
Hello.

This introduces a replacement for qsort() in GCC. The main selling point is
reproducibility (currently compiler output may change depending on how libc
qsort reorders not-bitwise-identical elements that compare equal) with a 
small improvement speed-wise and small code growth (under 2K on x86-64).

The opening comment in sort.cc gives a brief implementation overview:

/* This implements a sort function suitable for GCC use cases:
   - signature-compatible to C qsort, but relaxed contract:
 - may apply the comparator to elements in a temporary buffer
 - may abort on allocation failure
   - deterministic (but not necessarily stable)
   - fast, especially for common cases (0-5 elements of size 8 or 4)

   The implementation uses a network sort for up to 5 elements and
   a merge sort on top of that.  Neither stage has branches depending on
   comparator result, trading extra arithmetic for branch mispredictions.  */

I used a Sandy Bridge CPU to collect statistics on tramp3d -O2 compilation.

Overall the new implementation is roughly 30% faster compared to Glibc qsort,
with 2x or more speedup for cases with tiny element count. I see one instance
where the new approach is significantly (1.5x) slower: it is ipa-icf.c:
sort_congruence_class_groups_by_decl_uid. It sorts a big array (1500 entries)
and needs 14 indirect loads just to reach values to compare, so when branch
prediction manages to guess correctly, it allows to overlap execution of two
comparators and better hide their cache miss latency.

Overall GCC spends about 0.3% time under qsort, but this doesn't automatically
mean that this brings a 0.1% speed improvement: it may be larger or smaller
depending on how new code affects cache behavior and branch predictors in
other code, and it's not trivial to measure precisely.

I can go into more detail about measured stats if there's interest :)

Makefile.in changes are separated to patch 2 in the hope it'd make review
easier, but the two patches will need to be applied together.

Bootstrapped/regtested on x86-64, OK for trunk?

Alexander Monakov (2):
  gcc_qsort: source code changes
  gcc_qsort: build system changes

 gcc/Makefile.in |   9 ++-
 gcc/sort.cc | 232 
 gcc/system.h|   7 +-
 gcc/vec.c   |   2 +-
 4 files changed, 243 insertions(+), 7 deletions(-)
 create mode 100644 gcc/sort.cc

-- 
2.13.3