Re: [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0

Christophe Lyon via Gcc-patches Wed, 16 Dec 2020 09:16:07 -0800

On Wed, 2 Dec 2020 at 04:31, Daniel Engel <lib...@danielengel.com> wrote:
>
> Hi Christophe,
>
> On Thu, Nov 26, 2020, at 1:14 AM, Christophe Lyon wrote:
> > Hi,
> >
> > On Fri, 13 Nov 2020 at 00:03, Daniel Engel <lib...@danielengel.com> wrote:
> > >
> > > Hi,
> > >
> > > This patch adds an efficient assembly-language implementation of IEEE-
> > > 754 compliant floating point routines for Cortex M0 EABI (v6m, thumb-
> > > 1).  This is the libgcc portion of a larger library originally
> > > described in 2018:
> > >
> > >     https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html
> > >
> > > Since that time, I've separated the libm functions for submission to
> > > newlib.  The remaining libgcc functions in the attached patch have
> > > the following characteristics:
> > >
> > >     Function(s)                     Size (bytes)        Cycles          
> > > Stack   Accuracy
> > >     __clzsi2                        42                  23              0 
> > >       exact
> > >     __clzsi2 (OPTIMIZE_SIZE)        22                  55              0 
> > >       exact
> > >     __clzdi2                        8+__clzsi2          4+__clzsi2      0 
> > >       exact
> > >
> > >     __umulsidi3                     44                  24              0 
> > >       exact
> > >     __mulsidi3                      30+__umulsidi3      24+__umulsidi3  8 
> > >       exact
> > >     __muldi3 (__aeabi_lmul)         10+__umulsidi3      6+__umulsidi3   0 
> > >       exact
> > >     __ashldi3 (__aeabi_llsl)        22                  13              0 
> > >       exact
> > >     __lshrdi3 (__aeabi_llsr)        22                  13              0 
> > >       exact
> > >     __ashrdi3 (__aeabi_lasr)        22                  13              0 
> > >       exact
> > >
> > >     __aeabi_lcmp                    20                   13             0 
> > >       exact
> > >     __aeabi_ulcmp                   16                  10              0 
> > >       exact
> > >
> > >     __udivsi3 (__aeabi_uidiv)       56                  72 – 385        0 
> > >       < 1 lsb
> > >     __divsi3 (__aeabi_idiv)         38+__udivsi3        26+__udivsi3    8 
> > >       < 1 lsb
> > >     __udivdi3 (__aeabi_uldiv)       164                 103 – 1394      
> > > 16      < 1 lsb
> > >     __udivdi3 (OPTIMIZE_SIZE)       142                 120 – 1392      
> > > 16      < 1 lsb
> > >     __divdi3 (__aeabi_ldiv)         54+__udivdi3        36+__udivdi3    
> > > 32      < 1 lsb
> > >
> > >     __shared_float                  178
> > >     __shared_float (OPTIMIZE_SIZE)  154
> > >
> > >     __addsf3 (__aeabi_fadd)         116+__shared_float  31 – 76         8 
> > >       <= 0.5 ulp
> > >     __addsf3 (OPTIMIZE_SIZE)        112+__shared_float  74              8 
> > >       <= 0.5 ulp
> > >     __subsf3 (__aeabi_fsub)         8+__addsf3          6+__addsf3      8 
> > >       <= 0.5 ulp
> > >     __aeabi_frsub                   8+__addsf3          6+__addsf3      8 
> > >       <= 0.5 ulp
> > >     __mulsf3 (__aeabi_fmul)         112+__shared_float  73 – 97         8 
> > >       <= 0.5 ulp
> > >     __mulsf3 (OPTIMIZE_SIZE)        96+__shared_float   93              8 
> > >       <= 0.5 ulp
> > >     __divsf3 (__aeabi_fdiv)         132+__shared_float  83 – 361        8 
> > >       <= 0.5 ulp
> > >     __divsf3 (OPTIMIZE_SIZE)        120+__shared_float  263 – 359       8 
> > >       <= 0.5 ulp
> > >
> > >     __cmpsf2/__lesf2/__ltsf2        72                  33              0 
> > >       exact
> > >     __eqsf2/__nesf2                 4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >     __gesf2/__gesf2                 4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >     __unordsf2 (__aeabi_fcmpun)     4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >     __aeabi_fcmpeq                  4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >     __aeabi_fcmpne                  4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >     __aeabi_fcmplt                  4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >     __aeabi_fcmple                  4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >     __aeabi_fcmpge                  4+__cmpsf2          3+__cmpsf2      0 
> > >       exact
> > >
> > >     __floatundisf (__aeabi_ul2f)    14+__shared_float   40 – 81         8 
> > >       <= 0.5 ulp
> > >     __floatundisf (OPTIMIZE_SIZE)   14+__shared_float   40 – 237        8 
> > >       <= 0.5 ulp
> > >     __floatunsisf (__aeabi_ui2f)    0+__floatundisf     1+__floatundisf 8 
> > >       <= 0.5 ulp
> > >     __floatdisf (__aeabi_l2f)       14+__floatundisf    7+__floatundisf 8 
> > >       <= 0.5 ulp
> > >     __floatsisf (__aeabi_i2f)       0+__floatdisf       1+__floatdisf   8 
> > >       <= 0.5 ulp
> > >
> > >     __fixsfdi (__aeabi_f2lz)        74                  27 – 33         0 
> > >       exact
> > >     __fixunssfdi (__aeabi_f2ulz)    4+__fixsfdi         3+__fixsfdi     0 
> > >       exact
> > >     __fixsfsi (__aeabi_f2iz)        52                  19              0 
> > >       exact
> > >     __fixsfsi (OPTIMIZE_SIZE)       4+__fixsfdi         3+__fixsfdi     0 
> > >       exact
> > >     __fixunssfsi (__aeabi_f2uiz)    4+__fixsfsi         3+__fixsfsi     0 
> > >       exact
> > >
> > >     __extendsfdf2 (__aeabi_f2d)     42+__shared_float 38             8    
> > >  exact
> > >     __aeabi_d2f                     56+__shared_float 54 – 58     8     
> > > <= 0.5 ulp
> > >     __aeabi_h2f                     34+__shared_float 34             8    
> > >  exact
> > >     __aeabi_f2h                     84                 23 – 34         0  
> > >    <= 0.5 ulp
> > >
> > > Copyright assignment is on file with the FSF.
> > >
> > > I've built the gcc-arm-none-eabi cross-compiler using the 20201108
> > > snapshot of GCC plus this patch, and successfully compiled a test
> > > program:
> > >
> > >     extern int main (void)
> > >     {
> > >         volatile int x = 1;
> > >         volatile unsigned long long int y = 10;
> > >         volatile long long int z = x / y; // 64-bit division
> > >
> > >         volatile float a = x; // 32-bit casting
> > >         volatile float b = y; // 64 bit casting
> > >         volatile float c = z / b; // float division
> > >         volatile float d = a + c; // float addition
> > >         volatile float e = c * b; // float multiplication
> > >         volatile float f = d - e - c; // float subtraction
> > >
> > >         if (f != c) // float comparison
> > >             y -= (long long int)d; // float casting
> > >     }
> > >
> > > As one point of comparison, the test program links to 876 bytes of
> > > libgcc code from the patched toolchain, vs 10276 bytes from the
> > > latest released gcc-arm-none-eabi-9-2020-q2 toolchain.    That's a
> > > 90% size reduction.
> >
> > This looks awesome!
> >
> > >
> > > I have extensive test vectors, and have passed these tests on an
> > > STM32F051.  These vectors were derived from UCB [1], Testfloat [2],
> > > and IEEECC754 [3] sources, plus some of my own creation.
> > > Unfortunately, I'm not sure how "make check" should work for a cross
> > > compiler run time library.
> > >
> > > Although I believe this patch can be incorporated as-is, there are
> > > at least two points that might bear discussion:
> > >
> > > * I'm not sure where or how they would be integrated, but I would be
> > >   happy to provide sources for my test vectors.
> > >
> > > * The library is currently built for the ARM v6m architecture only.
> > >   It is likely that some of the other Cortex variants would benefit
> > >   from these routines.  However, I would need some guidance on this
> > >   to proceed without introducing regressions.  I do not currently
> > >   have a test strategy for architectures beyond Cortex M0, and I
> > >   have NOT profiled the existing thumb-2 implementations (ieee754-
> > >   sf.S) for comparison.
> >
> > I tried your patch, and I see many regressions in the GCC testsuite
> > because many tests fail to link with errors like:
> > ld: /gcc/thumb/v6-m/nofp/libgcc.a(_arm_cmpdf2.o): in function
> > `__clzdi2':
> > /libgcc/config/arm/cm0/clz2.S:39: multiple definition of
> > `__clzdi2';/gcc/thumb/v6-m/nofp/libgcc.a(_thumb1_case_sqi.o):/libgcc/config/arm/cm0/clz2.S:39:
> > first defined here
> >
> > This happens with a toolchain configured with --target arm-none-eabi,
> > default cpu/fpu/mode,
> > --enable-multilib --with-multilib-list=rmprofile and running the tests with
> > -mthumb/-mcpu=cortex-m0/-mfloat-abi=soft/-march=armv6s-m
> >
> > Does it work for you?
>
> Thanks for the feedback.
>
> I'm afraid I'm quite ignorant as to the gcc test suite infrastructure,
> so I don't know how to use the options you've shared above.  I'm cross-
> compiling the Windows toolchain on Ubuntu.  Would you mind sharing a
> full command line you would use for testing?  The toolchain is built
> with the default options, which includes "--target arm-none-eabi".
>


Why put Windows in the picture? This seems unnecessarily complicated...
I suggest you build your cross-toolchain on x86_64 ubuntu and run it
on x86_64 ubuntu (of course targetting arm)

The above options where GCC configure options, except for the last
one which I used when running the tests.

There is some documentation about how to run the GCC testsuite there:
https://gcc.gnu.org/install/test.html

Basically 'make check' should mostly work except for execution tests
for which you'll need to teach DejaGnu how to run the generated programs
on a real board or on a simulator.

I didn't analyze your patch, I just submitted it to my validation system:
https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20201130.patch2/report-build-info.html
- the red "regressed" items indicate regressions in the testsuite. You
can click on "log" to download the corresponding gcc.log
- the dark-red "build broken" items indicate that the toolchain build failed
- the orange "interrupted" items indicate an infrastructure problem,
so you can ignore such cases
- similarly the dark red "ref build failed" indicate that the
reference build failed for some infrastructure reason

for the arm-none-eabi target, several toolchain versions fail to
build, some succeed.
This is because I use different multilib configuration flags, it looks like the
ones involving --with-multilib=rmprofile are broken with your patch.

These ones should be reasonably easy to fix: no 'make check' involved.

For instance if you configure GCC with:
--target arm-none-eabi --enable-multilib --with-multilib-list=rmprofile
you should see the build failure.

HTH

Christophe

> I did see similar errors once before.  It turned out then that I omitted
> one of the ".S" files from the build.  My interpretation at that point
> was that gcc had been searching multiple versions of "libgcc.a" and
> unable to merge the symbols.  In hindsight, that was a really bad
> interpretation.   I was able to reproduce the error above by simply
> adding a line like "volatile double m = 1.0; m += 2;".
>
> After reviewing the existing asm implementations more closely, I
> believe that I have not been using the function guard macros (L_arm_*)
> as intended.  The make script appears to compile "lib1funcs.S" dozens of
> times -- once for each function guard macro listed in LIB1ASMFUNCS --
> with the intent of generating a separate ".o" file for each function.
> Because they were unguarded, my new library functions were duplicated
> into every ".o" file, which caused the link errors you saw.
>
> I have attached an updated patch that implements the macros.
>
> However, I'm not sure whether my usage is really consistent with the
> spirit of the make script.  If there's a README or HOWTO, I haven't
> found it yet.  The following points summarize my concerns as I was
> making these updates:
>
> 1.  While some of the new functions (e.g. __cmpsf2) are standalone,
>     there is a common core in the new library shared by several related
>     functions.  That keeps the library small.  For now, I've elected to
>     group all of these related functions together in a single object
>     file "_arm_addsubsf3.o" to protect the short branches (+/-2KB)
>     within this unit.  Notice that I manually assigned section names in
>     the code, so there still shouldn't be any unnecessary code linked in
>     the final build.  Does the multiple-".o" files strategy predate "-gc-
>     sections", or should I be trying harder to break these related
>     functions into separate compilation units?
>
> 2.  I introduced a few new macro keywords for functions/groups (e.g.
>     "_arm_f2h" and '_arm_f2h'.  My assumption is that some empty ".o"
>     files compiled for the non-v6m architectures will be benign.
>
> 3.  The "t-elf" make script implies that __mulsf3() should not be
>     compiled in thumb mode (it's inside a conditional), but this is one
>     of the new functions.  Moot for now, since my __mulsf3() is grouped
>     with the common core functions (see point 1) and is thus currently
>     guarded by the "_arm_addsubsf3.o" macro.
>
> 4.  The advice (in "ieee754-sf.S") regarding WEAK symbols does not seem
>     to be working.  I have defined __clzsi2() as a weak symbol to be
>     overridden by the combined function __clzdi2().  I can also see
>     (with "nm") that "clzsi2.o" is compiled before "clzdi2.o" in
>     "libgcc.a".  Yet, the full __clzdi2() function (8 bytes larger) is
>     always linked, even in programs that only call __clzsi2(),  A minor
>     annoyance at this point.
>
> 5.  Is there a permutation of the makefile that compiles libgcc with
>     __OPTIMIZE_SIZE__?  There are a few sections in the patch that can
>     optimize either way, yet the final product only seems to have the
>     "fast" code.  At this optimization level, the sample program above
>     pulls in 1012 bytes of library code instead of 836. Perhaps this is
>     meant to be controlled by the toolchain configuration step, but it
>     doesn't follow that the optimization for the cross-compiler would
>     automatically translate to the target runtime libraries.
>
> Thanks again,
> Daniel
>
> >
> > Thanks,
> >
> > Christophe
> >
> > >
> > > I'm naturally hoping for some action on this patch before the Nov 16th 
> > > deadline for GCC-11 stage 3.  Please review and advise.
> > >
> > > Thanks,
> > > Daniel Engel
> > >
> > > [1] http://www.netlib.org/fp/ucbtest.tgz
> > > [2] http://www.jhauser.us/arithmetic/TestFloat.html
> > > [3] http://win-www.uia.ac.be/u/cant/ieeecc754.html
> >

Re: [PATCH] libgcc: Thumb-1 Floating-Point Library for Cortex M0

Reply via email to