Re: [PATCH v9] Practical improvement to libgcc complex divide

2021-04-01 Thread Patrick McGehearty via Gcc-patches

I would very much like to get all the details right on my next submission.
I'm submitting this query for feedback before I do the full submission.

Most changes requested require obvious fixes. One has me confused about
the best action to take.

- - - -

On the issue of alloca vs XALLOCAVEC, I discovered that XALLOCAVEC has
been around and recommended since at least 2010, but I could not find
any discussion of why XALLOCAVEC is preferred to alloca. Understanding
the purpose of making the change would help in resolve the following
decision.

Currently,in gcc/c-family/c-cppbuiltin.c, XALLOCAVEC does not appear
at all. alloca appears 13 times. Both appear in other files in the
same directory. In the routine I'm modifying, four current allocations
to macro_name use alloca.

I see four options:

1) Use alloca as that is consistent with the existing style within
the source file. That is my current choice, but disliked by the
reviewer.

2) If I use XALLOCAVEC just for my three new uses of alloca, which
also assign to macro_name, we will have an inconsistent coding
pattern. That violates the principle of maintaining existing style
within a source file or routine. This choice seems ugly.

3) If I use XALLOCATE for my three new uses for alloca and
change the other four allcoations to macro_name in the same
routine but leave the other 9 uses of alloca alone, we have
consistency in the modified routine but not the source file.
If this is the recommended procedure, I'm willing.

4) If I change all uses of alloc to XALLOCATE in the file, that makes
the file consistent but adds source file changes to the patch in
routines that are completely unrelated to my goal of improving complex
divide. It increases risk of causing subtle problems either through
my editing errors or lack of understanding by me of any differences
between alloca and XALLOCATE.  I'm less comfortable with this option.

- - - -
- - - -
The rest appear to be editing issues on my part.
I'll will make the following changes.

- - - -

I will fix the type of name_len to const size_t.

- - - -


I will fix abort/exit in the three test files to be:
extern void abort(void);
extern void exit(int status);

- - - -
I will proofread and fix the comments in the three test files.
than/that, missing "to", etc.

- - - -
In libgcc/config/rs6000/_divkc3.c

I see one additional blank line that will be removed.

- - - -
In libgcc/libgcc2.c

- - - -

I admit the names XCTYPE and XMTYPE are not very imaginative. My
intention was represent "extra precision" but the potential confusion
with extended precision did not occur to me at the time.  The
XCTYPE/XMTYPE is intended to be the next larger precision to allow the
same source code to be used for both the half precision and float
precision cases.

Maybe it would be better to think "additional precision",
and use ACTYPE and AMTYPE?

- - - -
Comment "...errors than can..." should be "...errors that can..."
Will make the change.


On 3/31/2021 12:03 PM, Bernhard Reutner-Fischer wrote:

On Fri, 26 Mar 2021 23:14:41 +
Patrick McGehearty via Gcc-patches  wrote:


diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
index 9f993c4..c0d9e57 100644
--- a/gcc/c-family/c-cppbuiltin.c
+++ b/gcc/c-family/c-cppbuiltin.c
@@ -1277,8 +1277,10 @@ c_cpp_builtins (cpp_reader *pfile)
{
  scalar_float_mode mode = mode_iter.require ();
  const char *name = GET_MODE_NAME (mode);
+ const int name_len = strlen (name);

strlen returns a size_t

Funny that we do not have a pre-seeded mode_name_len array but call
strlen on modes over and over again.


+ char float_h_prefix[16] = "";
  char *macro_name
-   = (char *) alloca (strlen (name)
+   = (char *) alloca (name_len
   + sizeof ("__LIBGCC__MANT_DIG__"));
  sprintf (macro_name, "__LIBGCC_%s_MANT_DIG__", name);
  builtin_define_with_int_value (macro_name,
@@ -1286,20 +1288,29 @@ c_cpp_builtins (cpp_reader *pfile)
  if (!targetm.scalar_mode_supported_p (mode)
  || !targetm.libgcc_floating_mode_supported_p (mode))
continue;
- macro_name = (char *) alloca (strlen (name)
+ macro_name = (char *) alloca (name_len
+ sizeof ("__LIBGCC_HAS__MODE__"));
  sprintf (macro_name, "__LIBGCC_HAS_%s_MODE__", name);
  cpp_define (pfile, macro_name);
- macro_name = (char *) alloca (strlen (name)
+ macro_name = (char *) alloca (name_len
+ sizeof ("__LIBGCC__FUNC_EXT__"));
  sprintf (macro_name, "__LIBGCC_%s_FUNC_EXT__", name);

The above should have been split out as separate independent patchlet
to get it out of the way.

As noted already the use of alloca is a pre-existing (coding-style) bug
and we should use XALLOCAVEC or appropriately sized fixed buffers to
begin with. The amount of alloca in defining all these is amazing.



Re: [PATCH v9] Practical improvement to libgcc complex divide

2021-03-31 Thread Bernhard Reutner-Fischer via Gcc-patches
On Fri, 26 Mar 2021 23:14:41 +
Patrick McGehearty via Gcc-patches  wrote:

> diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
> index 9f993c4..c0d9e57 100644
> --- a/gcc/c-family/c-cppbuiltin.c
> +++ b/gcc/c-family/c-cppbuiltin.c
> @@ -1277,8 +1277,10 @@ c_cpp_builtins (cpp_reader *pfile)
>   {
> scalar_float_mode mode = mode_iter.require ();
> const char *name = GET_MODE_NAME (mode);
> +   const int name_len = strlen (name);

strlen returns a size_t

Funny that we do not have a pre-seeded mode_name_len array but call
strlen on modes over and over again.

> +   char float_h_prefix[16] = "";
> char *macro_name
> - = (char *) alloca (strlen (name)
> + = (char *) alloca (name_len
>  + sizeof ("__LIBGCC__MANT_DIG__"));
> sprintf (macro_name, "__LIBGCC_%s_MANT_DIG__", name);
> builtin_define_with_int_value (macro_name,
> @@ -1286,20 +1288,29 @@ c_cpp_builtins (cpp_reader *pfile)
> if (!targetm.scalar_mode_supported_p (mode)
> || !targetm.libgcc_floating_mode_supported_p (mode))
>   continue;
> -   macro_name = (char *) alloca (strlen (name)
> +   macro_name = (char *) alloca (name_len
>   + sizeof ("__LIBGCC_HAS__MODE__"));
> sprintf (macro_name, "__LIBGCC_HAS_%s_MODE__", name);
> cpp_define (pfile, macro_name);
> -   macro_name = (char *) alloca (strlen (name)
> +   macro_name = (char *) alloca (name_len
>   + sizeof ("__LIBGCC__FUNC_EXT__"));
> sprintf (macro_name, "__LIBGCC_%s_FUNC_EXT__", name);

The above should have been split out as separate independent patchlet
to get it out of the way.

As noted already the use of alloca is a pre-existing (coding-style) bug
and we should use XALLOCAVEC or appropriately sized fixed buffers to
begin with. The amount of alloca in defining all these is amazing.

> diff --git a/gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkd.c 
> b/gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkd.c
> new file mode 100644
> index 000..409123f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkd.c
> @@ -0,0 +1,126 @@
> +/*
> +  Program to test complex divide for correct results on selected values.
> +  Checking known failure points.
> +*/
> +
> +#include 
> +
> +extern void abort ();
> +extern void exit ();

As Joseph pointed out in the v8 review already, please use prototyped
function declarations in all new tests.

> diff --git a/gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkld.c 
> b/gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkld.c
> new file mode 100644
> index 000..28d707d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkld.c
> @@ -0,0 +1,167 @@
> +/*
> +  Program to test complex divide for correct results on selected values.
> +  Checking known failure points.
> +*/
> +
> +#include 
> +
> +extern void abort ();
> +extern void exit ();

Likewise.

> +#if (LDBL_MAX_EXP < 2048)
> +  /*
> +Test values when mantissa is 11 or fewer bits.  Either LDBL is
> +using DBL on this platform or we are using IBM extended double
> +precision Test values will be automatically trucated the available
> +precision.

s/trucated/truncated/g; # with an 'n'
I have trouble parsing the sentence nevertheless. There might be a
missing "which" somewhere and maybe a "to"?

> +#else
> +  /*
> +Test values intended for either IEEE128 or Intel80 formats.  In
> +either case, 15 bits of exponent are available.  Test values will
> +be automatically trucated the available precision.
> +  */

again, truNcated and a couple of words missing?

> diff --git a/libgcc/config/rs6000/_divkc3.c b/libgcc/config/rs6000/_divkc3.c
> index d261f40..f57e14f 100644
> --- a/libgcc/config/rs6000/_divkc3.c
> +++ b/libgcc/config/rs6000/_divkc3.c
> @@ -37,31 +37,118 @@ see the files COPYING3 and COPYING.RUNTIME respectively. 
>  If not, see
>  #define __divkc3 __divkc3_sw
>  #endif
>  
> +#define RBIG   (__LIBGCC_TF_MAX__ / 2)
> +#define RMIN   (__LIBGCC_TF_MIN__)
> +#define RMIN2  (__LIBGCC_TF_EPSILON__)
> +#define RMINSCAL (1 / __LIBGCC_TF_EPSILON__)
> +#define RMAX2  (RBIG * RMIN2)
> +
> +
>  TCtype
>  __divkc3 (TFtype a, TFtype b, TFtype c, TFtype d)
>  {
>TFtype denom, ratio, x, y;
>TCtype res;
>  
> -  /* ??? We can get better behavior from logarithmic scaling instead of
> - the division.  But that would mean starting to link libgcc against
> - libm.  We could implement something akin to ldexp/frexp as gcc builtins
> - fairly easily...  */
> +  /* long double has significant potential underflow/overflow errors than
> + can be greatly reduced with a limited number of tests and adjustments.
> +  */

"than" doesn't parse. that?

>else
>  {
> +  /* Prevent underflow when denominator is near max representable.  */
> +  if (FABS (c) >= RBIG)
> + {
> +   

Re: [PATCH v9] Practical improvement to libgcc complex divide

2021-03-30 Thread Patrick McGehearty via Gcc-patches

Thank you for your interest in this project.

On 3/27/2021 6:18 PM, Bernhard Reutner-Fischer wrote:

On Fri, 26 Mar 2021 23:14:41 +
Patrick McGehearty via Gcc-patches  wrote:


Changes in Version 9 since Version 8:

Revised code to meet gcc coding standards in all files, especially
 with respect to adding spaces around operations and removing
 excess () in #define macro definitions.

Did you gather cycle counter diffs for current against new for the
normal and some edge cases?

I don't see additional value in looking at cycle counters in addition
to the timing information I presented. I prefer wall clock time to
cycle counters, perhaps because cycle counters are less reliably
implemented on some platforms (that's a separate discussion, though).
Every platform will have different counters and different timing.

Let me run through a rough operation count analysis of the double
precision case. I will also postulate instruction latencies
that might be typical of a recent processor. Real processors
will different somewhat but not by large amounts.

The current method uses Smith's method. Showing half of that method:
  if (fabs(c) < fabs(d))
    {
  ratio = c / d;
  denom = (c * ratio) + d;
  xx = ((a * ratio) + b) / denom;
  yy = ((b * ratio) - a) / denom;
    }


It requires:
two fabs computations (typically 1 cycle each)
one floating point compare (typically 2 cycles)
1 branch decision (50-50 prediction reliabilty)
    (1 cycle if predicted, 9 cycles if not predicted; ave=5)
3 multiplys, 3 divides, 3 add/subtract operations
  (assume 5 cycle latency for add/mul and 15 cycle for divide)
    (75 cycles)
Total: 84 cycles (typical average, old method)

I omit the NAN analysis as that is unchanged.

The cost of the new method is data dependent.
First, assume all scaling tests are false and ratio does not underflow.
Further, assume that data is consistently in those ranges, to allow
all branch predictions except the first to be predicted.
I'll also assume we have an excellent branch predictor mechanism
that can handle predicting

Then we have:
5 additional fabs, and 4 additional compare and branches (all predicted)
and one || operations for 18 additional cycles.
[This optimistic count assumes fabs(a,b) > RMIN, allowing
us to bypass the fabs < RMAX2 tests.

New total: 84+18 = 102 (optimistic average, new method)

If we can't bypass the fabs < RMAX2 tests, we need to add
four more fabs, and four && operations, and four compare
operations for 12 more cycles.
Possible total: 84+18+12 = 114

Finally, if we hit a scaling operation, then we have four
multiply operations for an additional 20 cycles plus
we've mispredicted the branch for 9 more.
114+28 = 144 cycles.
This analysis is not particularly accurate. It does not account for
some architectures having several floating point units or for
speculative execution hiding latency or many other variations in
computer architecture. Still, its good enough for a ballpark estimate
that says the typical case will be around 15-25% slower and the most
extreme cases will be around 70% slower.

When we look at the patch writeup, I report that for double precision,
the "moderate set" which represents the typical case, we see 4% to 24%
measured slowdown. The "full set" which represents having a
substantial number of values which might cause branch predictors to go
wrong as costing 38% to 56% overhead. Apparently my 'back of the
envelope' analysis above is not too far off.

Oh, I failed to cover the case where fabs(ratio) < RMIN.
That happens in the full set about 1-2% of the time.
It requires two additional divide operations for an extra
30 cycles, plus maybe 10 cycles for the mispredicted branch.
That might bump up the worst case to be about double the
normal case. That's acceptable when you consider that slow
path prevents the code from returning a completely wrong result.

As a quick aside, the test for ratio underflowing reduces the error
rate by a factor of 4 from 1.70% to 0.35% in the full exponent range
data set with about half the total overhead of the new method. Adding
all the other tests and scaling code reduces errors larger than 12
bits to roughly 1 in 10 million and 18 bits or larger to roughly 1 in
100 million. Those remaining errors are due to the issue of
subtraction cancellation, which is a harder problem than avoiding
underflow/overflow issues.




Can you please share data for -Os between current and your proposed new?

As for data for -Os for current and proposed new, it's not very interesting.
There are no differences as complex divide is not inlined by gcc.
Instead, there is a simple call to __divdc3.
I checked for inlining with gcc 4.8, gcc8.3, and gcc11.0.1 as well as with
a compiler that has my code changes.

If one uses -fcx-fortran-rules then Smith's method for complex divide is 
inlined with -Os.
If one uses -fcx-limited-range then the naive method for complex divide 
is inlined with -Os.

The new code does not change 

Re: [PATCH v9] Practical improvement to libgcc complex divide

2021-03-27 Thread Bernhard Reutner-Fischer via Gcc-patches
On Fri, 26 Mar 2021 23:14:41 +
Patrick McGehearty via Gcc-patches  wrote:

> Changes in Version 9 since Version 8:
> 
> Revised code to meet gcc coding standards in all files, especially
> with respect to adding spaces around operations and removing
> excess () in #define macro definitions.

Did you gather cycle counter diffs for current against new for the
normal and some edge cases?

Can you please share data for -Os between current and your proposed new?

TIA,


[PATCH v9] Practical improvement to libgcc complex divide

2021-03-26 Thread Patrick McGehearty via Gcc-patches
Changes in Version 9 since Version 8:

Revised code to meet gcc coding standards in all files, especially
with respect to adding spaces around operations and removing
excess () in #define macro definitions.

Major revision to gcc/testsuite/gcc.c-torture/execute/ieee/cdivchkld.c
Prior code was focused on x86 80-bit implementation of long
doubles. The new version provides tests for: IEEE 128-bit format,
80-bit format, ibm extended precision format (11-bit exponent, 113
bit mantissa), and when long double is treated as IEEE 64-bit
doubles (also 11-bit exponent). The limits tested are now based on
LDBL_MIN and LDBL_MANT_DIG which automatically adapts to the
appropriate format. The IEEE 128-bit and 80-bit formats share
input test values due to having matching 15-bit exponent sizes as
doe the IBM extended double and normal 64-bit double having 11-bit
exponent sizes.  The input values are accurate to the lesser
mantissa with zero fill for the extended length mantissas. When
the test with the smaller mantissa is active, the expected result
will automatically be truncated to match the available mantissa.
The size of the exponent (LDBL_MAX_EXP) is used to determine which
values are tested. The program was tested using x86 (80-bit),
aarch64 (128-bit), IBM extended format and IEEE 64-bit format.
Some adjustments where made due to IBM extended format having
slightly less range at the bottom end than IEEE 64-bit format in
spite of having the same size exponent field. For all four cases,
the cdivchkld.c program will abort() under the old complex divide
and exit() under the new complex divide.

Correctness and performance test programs used during development of
this project may be found in the attachment to:
https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg254210.html

Summary of Purpose

This patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default complex divide routine (half, float, double, extended,
long double precisions) when dealing with very large or very small exponents.

The current code correctly implements Smith's method (1962) [2]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
substantially different from the answers provided by quad precision
more than 1% of the time. This error rate may be unacceptable for many
applications that cannot a priori restrict their computations to the
safe range. The proposed method reduces the frequency of
"substantially different" answers by more than 99% for double
precision at a modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

Background

This project started with an investigation related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714.  Study of Beebe[1]
provided an overview of past and recent practice for computing complex
divide. The current glibc implementation is based on Robert Smith's
algorithm [2] from 1962.  A google search found the paper by Baudin
and Smith [3] (same Robert Smith) published in 2012. Elen Kalda's
proposed patch [4] is based on that paper.

I developed two sets of test data by randomly distributing values over
a restricted range and the full range of input values. The current
complex divide handled the restricted range well enough, but failed on
the full range more than 1% of the time. Baudin and Smith's primary
test for "ratio" equals zero reduced the cases with 16 or more error
bits by a factor of 5, but still left too many flawed answers. Adding
debug print out to cases with substantial errors allowed me to see the
intermediate calculations for test values that failed. I noted that
for many of the failures, "ratio" was a subnormal. Changing the
"ratio" test from check for zero to check for subnormal reduced the 16
bit error rate by another factor of 12. This single modified test
provides the greatest benefit for the least cost, but the percentage
of cases with greater than 16 bit errors (double precision data) is
still greater than 0.027% (2.7 in 10,000).

Continued examination of remaining errors and their intermediate
computations led to the various tests of input value tests and scaling
to avoid under/overflow. The current patch does not handle some of the
rare and most extreme combinations of input values, but the random
test data is only showing 1 case in 10 million that has an error of
greater than 12 bits. That case has 18 bits of error and is due to
subtraction cancellation. These results are significantly better
than the results reported by Baudin and Smith.

Support for half, float, double, extended, and long double precision
is included as all are handled with suitable preprocessor