Re: [PATCH] Practical Improvement to libgcc Complex Divide

Patrick McGehearty via Gcc-patches Tue, 08 Dec 2020 14:32:45 -0800

It took some work, but I think I've responded to all the issues raised here.
Patch V5 coming right after this mail.


On 11/16/2020 8:34 PM, Joseph Myers wrote:

On Tue, 8 Sep 2020, Patrick McGehearty via Gcc-patches wrote:

This project started with an investigation related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714.  Study of Beebe[1]
provided an overview of past and recent practice for computing complex
divide. The current glibc implementation is based on Robert Smith's
algorithm [2] from 1962.  A google search found the paper by Baudin
and Smith [3] (same Robert Smith) published in 2012. Elen Kalda's
proposed patch [4] is based on that paper.

Thanks, I've now read Baudin and Smith so can review the patch properly.
I'm fine with the overall algorithm, so my comments generally relate to
how the code should best be integrated into libgcc while keeping it
properly machine-mode-generic as far as possible.

I developed two sets of test set by randomly distributing values over
a restricted range and the full range of input values. The current

Are these tests available somewhere?


After some polishing, the development tests are now ready to share.
I've got them in a single directory (a README, 47 mostly small .c files,
various scripts for running tests and sample outputs from all the tests.
the tarball totals about 0.5 MBytes.) These tests are intended for
developing and comparing different complex divide algorithms.
They are NOT intended or structured for routine compiler testing.
The complex divide compiler tests are in the accompanying patch
and are discussed later in this note.

Support for half, float, double, extended, and long double precision
is included as all are handled with suitable preprocessor symbols in a
single source routine. Since half precision is computed with float
precision as per current libgcc practice, the enhanced algorithm
provides no benefit for half precision and would cost performance.
Therefore half precision is left unchanged.

The existing constants for each precision:
float: FLT_MAX, FLT_MIN;
double: DBL_MAX, DBL_MIN;
extended and/or long double: LDBL_MAX, LDBL_MIN
are used for avoiding the more common overflow/underflow cases.

In general, libgcc code works with modes, not types; hardcoding references
to a particular mapping between modes and types is problematic.  Rather,
the existing code in c-cppbuiltin.c that has a loop over modes should be
extended to provide whatever information is needed, as macros defined for
each machine mode.

   /* For libgcc-internal use only.  */
   if (flag_building_libgcc)
     {
       /* Properties of floating-point modes for libgcc2.c.  */
       opt_scalar_float_mode mode_iter;
       FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_FLOAT)
         {
[...]

For example, that defines macros such as __LIBGCC_DF_FUNC_EXT__ and
__LIBGCC_DF_MANT_DIG__.  The _FUNC_EXT__ definition involves that code
computing a mapping to types.

I'd suggest defining additional macros such as __LIBGCC_DF_MAX__ in the
same code - for each supported floating-point mode.  They can be defined
to __FLT_MAX__ etc. (the predefined macros rather than the ones in
float.h) - the existing code that computes a suffix for functions can be
adjusted so it also computes the string such as "FLT", "DBL", "LDBL",
"FLT128" etc.

(I suggest defining to __FLT_MAX__ rather than to the expansion of
__FLT_MAX__ because that avoids any tricky interactions with the logic to
compute such expansions lazily.  I suggest __FLT_MAX__ rather than the
FLT_MAX name from float.h because that way you avoid any need to define
feature test macros to access names such as FLT128_MAX.)


After some study, I've done my best to follow your recommendations
for using modes. I've defined __LIBGCC_xx_yyy__, where xx is SF, DF,
XF, TF and yyy is MIN, MAX, and EPSILON in c-cppbuiltin.c. SF uses FLT,
DF used DBL, and XF and TF use LDBL. There is no need for those values
in HF mode because the HF code always uses SF precision.

diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
index 74ecca8..02c06d8 100644
--- a/gcc/c-family/c-cppbuiltin.c
+++ b/gcc/c-family/c-cppbuiltin.c
@@ -1343,6 +1343,11 @@ c_cpp_builtins (cpp_reader *pfile)
        builtin_define_with_value ("__LIBGCC_INIT_SECTION_ASM_OP__",
                                 INIT_SECTION_ASM_OP, 1);
  #endif
+      /* For libgcc float/double optimization */
+#ifdef HAVE_adddf3
+      builtin_define_with_int_value ("__LIBGCC_HAVE_HWDBL__",
+                                HAVE_adddf3);
+#endif

This is another thing to handle more generically - possibly with something
like the mode_has_fma function, and defining a macro for each mode, named
after the mode, rather than only for DFmode.  For an alternative, see the
discussion below.


Defining this value generically but not using/testing it seems
more likely to be subject it future issues when someone tries
to use it, especially since I have no knowledge of how to
test for presence/absence of HW support for various modes besides
HW double precision. I have taken no action on this comment.

diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog
index ccfd6f6..8bd66c5 100644
--- a/libgcc/ChangeLog
+++ b/libgcc/ChangeLog
@@ -1,3 +1,10 @@
+2020-08-27  Patrick McGehearty <patrick.mcgehea...@oracle.com>
+
+       * libgcc2.c (__divsc3, __divdc3, __divxc3, __divtc3): Enhance
+       accuracy of complex divide by avoiding underflow/overflow when
+       ratio underflows or when arguments have very large or very
+       small exponents.

Note that diffs to ChangeLog files should not now be included in patches;
the ChangeLog content needs to be included in the proposed commit message
instead for automatic ChangeLog generation.


I have moved info for ChangeLog files into the commit message.
I believe I have the format for the ChangeLog entry correct.

+#if defined(L_divsc3)
+#define RBIG   ((FLT_MAX)/2.0)
+#define RMIN   (FLT_MIN)
+#define RMIN2  (0x1.0p-21)
+#define RMINSCAL (0x1.0p+19)
+#define RMAX2  ((RBIG)*(RMIN2))
+#endif

I'd expect all of these to use generic macros based on the mode.  For the
division by 2.0, probably also divide by integer 2 not 2.0 to avoid
unwanted conversions to/from double.


DONE

+#if defined(L_divdc3)
+#define RBIG   ((DBL_MAX)/2.0)
+#define RMIN   (DBL_MIN)
+#define RMIN2  (0x1.0p-53)
+#define RMINSCAL (0x1.0p+51)

A plausible definition of RMIN2 might be based on the EPSILON macro for
the mode (with RMINSCAL then being defined in terms of RMIN2).  But you're
defining RMIN2 for SCmode to a value equal to 4 *
FLT_EPSILON, while for DCmode you're using DBL_EPSILON / 2.  Why the
difference?


In experimental testing, the number of remaining failures did not change
significantly when the RMIN2 value was adjusted up or down slightly.
Addition thought caused me to realize that the primary value of this
optimization was due to scaling subnormals up to normal. Using
EPSILON is the correct way to achieve that goal.
Further testing showed that using RMIN2 to be FLT_EPSILON or DBL_EPSILON

as appropriate gave results very slightly better than those reportedeariler.

RMIN2 is now [FLT/DBL]_EPSILON and RMINSCAL is 1/EPSILON.

+#if defined(L_divhc3)
+  /* half precision is handled with float precision.
+     no extra measures are needed to avoid overflow/underflow */

Note for comment style that each sentence should start with an uppercase
letter (unless that would be incorrect, e.g. for "float" meaning the C
type) and comments should end with ".  " (two spaces after '.').


DONE


You're duplicating essentially the same code for HCmode, and for SCmode
when hardware DFmode is available.  I think that's a bad idea.  Rather, in
one place you should conditionally define a macro for "wider-range
floating-point type to use for computations in these functions" (the
relevant thing being that it has at least twice (plus a bit) the exponent
range, rather than extra precision).  This would use types such as SFtype
or DFtype, not float or double, as elsewhere in libgcc.  And if that macro
is defined, then use this simple implementation instead of the more
complicate one.

That leaves the question of how to define the macro.  The simple approach
is certainly to define it to SFtype in the HCmode case (if there is
hardware SFmode) and to DFtype in the SCmode case (if there is hardware
DFmode).  A more complicated approach involves using various macros to
examine properties of various possible wider-range modes to attempt to
identity one that could be used (probably new macros from c-cppbuiltin.c
rather than following the existing code elsewhere in libgcc2.c using
AVOID_FP_TYPE_CONVERSION, given the incomplete coverage of definitions of
WIDEST_HARDWARE_FP_SIZE).  But hardcoding two cases would reduce the risk
of unexpected results for now.


You are correct and the code for the two cases has been merged.
I defined XMTYPE and XCTYPE to be the next higher precision
for MTYPE and CTYPE. The code is cleaner and easier to read now.

I also wrote new tests to compare the simple implementation
((real=a*c+b*d imag=b*c-a*d) to Smith's method. With the higher
precision, the simple method gives the correct answers with
modestly better performance. I've changed the code to use
the simple method.


In any case, note libgcc/config/rs6000/_divkc3.c should have similar
changes to those in libgcc2.c (to implement the more complicated
algorithm, as no wider mode is available there).


DONE as copy of the libgcc2 code but not subjected to my extended tests.

+  /* Scale by max(c,d) to reduce chances of denominator overflowing */
+  if (FABS (c) < FABS (d))

The comment seems questionable in the code using the simpler algorithm.
And do you need this "if" and two cases at all there?


In the longer code segment, the comment is changed to:
 "Prevent underflow when denominator is near max representable."

In the half-float using float and float using double segment, as notedabove,

I switched to using the simple implementation.

+      /* prevent underflow when denominator is near max representable */
+      if (FABS (d) >= RBIG)
+       {
+         a = a * 0.5;
+         b = b * 0.5;
+         c = c * 0.5;
+         d = d * 0.5;

This sort of thing might best use "/ 2" (integer 2) to avoid unwanted use
of double.


DONE

+       {
+         if(((FABS (a) < RMIN) && (FABS (b) < RMAX2) && (FABS (d) < RMAX2)) ||
+            ((FABS (b) < RMIN) && (FABS (a) < RMAX2) && (FABS (d) < RMAX2)))

GNU style breaks lines before operators such as "||", not after.  Missing
space after "if".


DONE


I think the patch should add some testcases to the GCC testsuite that
verify expected results for particular inputs to within +/- a few ulp (and
that are checked to fail before and pass after the patch), to illustrate
the sort of cases the patch improves.  Those will need to take the inputs
from volatile variables to ensure the evaluation is at runtime, and it may
be easiest to write such tests only for the cases of _Complex float and
_Complex double to avoid complications making them portable when testing
other types.


I created three tests: cdivchkf.c cdivchkd.c cdivchkld.c.  They exit
with 0 if they pass and abort if they fail. Each tests a small number
of values. The long double tests work correctly with either 80-bit or
128-bit IEEE formats. Maybe I should omit the long double test,
but it was easy to add.

I was not sure where to place them as there is no testsuite for libgcc.
I chose the next most likely spot seems to be
gcc/testsuite/gcc.c-torture/execute/ieee.

I look forward to your further thoughts.
The patch will be sent immediately after this message.

- Patrick

Re: [PATCH] Practical Improvement to libgcc Complex Divide

Reply via email to