Re: [sqlite] Question about floating point

Dennis Clarke Tue, 18 Dec 2018 14:48:00 -0800

On 12/18/18 6:01 AM, R Smith wrote:> On 2018/12/17 11:53 PM, DennisClarke wrote:

>>
>> This thread is getting out of hand. Firstly there is no such binary
>> representation ( in this universe ) for a trivial decimal number such as
>> one tenth ( 0.10 ) and really folks should refer to the text book
>> recently published ( 2nd Edition actually ) where all this is covered
>> : //....
>
>
> My good man, did the discussion really irritate you ...
[WARNING : written with a smile ]


Well, I guess the real issue is that I see fairly baseline stuff going
in circles over and over and over and over and very little clarity. I
merely posted that textbook because it really was written by the experts
and I have done a fair amount of emails in life back and forth to a few
of the authors on various topics who always cleared the air. They really
are the experts and the only name missing from that list is the great
and dreaded William Kahan himself.  I say that with a smile as Professor
Emeritus of Mathematics Kahan is well known to write very bluntly about
people who have not a clue about trivial things. Trivial to him. The
rest of us merely try to catch up and get a solid understanding of the
basics which, as I was saying, have been covered over and over and over
and over in circles over and over and it gets ... annoying to walk into
a store and hear Beethoven's Ninth Symphony played over dirty cheap
speakers.  Certainly when I have heard live performances a few times in
my life. Very irritating is the word.

I see this sort of thing happen from time to time and I have to take the
approach of 'care' or 'do not care'.  In the case of the store with bad
speakers the option is 'do not care' and simply accept the noise. In the
case were good people can be led down a wrong path and then fall into a
pitfall or trap I feel 'care' happens.  I am not a sociopath nor some
old greybeard UNIX geek that merely enjoys retirement too much to 'care'
anymore.

So let's play a little game based on an early lesson from William Kahan
which will demonstrate how poorly floating point works when used with
nothing but blind trust in bit games.  Also we will assume that we are
going to play by the rules of his IEEE 754 and I may make a passing
reference to these three documents :

    1 ) IEEE 754-2008 - IEEE Standard for Floating-Point Arithmetic
        https://standards.ieee.org/standard/754-2008.html

    2 ) Formal Verification of Floating-Point Hardware Design
        https://www.springer.com/us/book/9783319955124

    3 ) Handbook of Floating-Point Arithmetic
        https://www.springer.com/us/book/9783319765259

    4 ) The Art of Computer Programming
        https://cs.stanford.edu/~knuth/taocp.html

    5 ) Oral history interview with Donald E. Knuth
        Charles Babbage Institute, 2001
        https://conservancy.umn.edu/handle/11299/107413

    6 ) How Futile are Mindless Assessments of Roundoff
        in Floating-Point Computation ?
        Prof William Kahan
        https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf

Also perhaps ISO/IEC/IEEE 60559:2011 and working group ISO/IECJTC1/SC22/WG11 publications.


Let's begin with a quick and incomplete definition of "floating point"
data representation thus :

    Given a radix B with precision p we express a 'floting point'
    number in the format

        [  (+/-)m_0 . m_1 m_2 m_3 ... m_(p-1)  ]  *  B^e

    where we call e the exponent which is always an integer and the
    expression m_0 . m_1  m_2 m_3 ... m_(p-1) will be called the not
    so friendy word "significand" which is expressed in radix B.

A complete and formal definition will be found in chapter 3 of reference
(3) above. This is not a new idea and in fact is quite old. One may find
a fairly nice history of "floating point" in computing machines such as
the Babbage difference engine ( see Charles Babbage et. al. ) and other
machines that performed numerical computation in (4) and (5) above.
Suffice it to say that radix 60 mathematics was common in the Babylonian
history and the Yale Babylonian Collection provides a tablet with an
approximation of the positive square root of two with four sexagesimal
digits 1, 24, 51, 10.  Numerical data representation is not new at all
however I personally fell into the problem in while working on long term
trajectory computations with early Apollo systems. We had not yet been
able to establish a formal method to detect and handle numerical error
conditions and the much loved William Kahan provides us this rather
trivial example to illustrate:

    Given   u_0 = 2  and   u_1 = -4   we then compute

        u_[n] = 111 - 1130/(u[n-1]) + 3000/(u[n-1]*u[n-2])

    where n > 1 clearly.


Trivial :


    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>

    #define LOOPCNT 30

    int main (int argc, char *argv[])
    {
        long double u[LOOPCNT];
        int j;

        u[0] = (long double)2.0L;
        u[1] = (long double)-4.0L;

        printf("u[00] = %+20.18Lf\n", u[0]);
        printf("u[01] = %+20.18Lf\n", u[1]);

        for (j = 2; j < LOOPCNT; j++){
            u[j] =   (long double)111.0L
                   - (long double)1130.0L/u[j-1]
                   + (long double)3000.0L/(u[j-1]*u[j-2]);
            printf("u[%02i] = %+20.18Lf\n", j, u[j]);
        }
        return EXIT_SUCCESS;
    }

$ $CC $CFLAGS -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -o foo foo.c
$ ./foo
u[00] = +2.000000000000000000
u[01] = -4.000000000000000000
u[02] = +18.500000000000000000
u[03] = +9.378378378378378379
u[04] = +7.801152737752161387
u[05] = +7.154414480975249402
u[06] = +6.806784736923633658
u[07] = +6.592632768704448309
u[08] = +6.449465933790438615
u[09] = +6.348452056656695458
u[10] = +6.274438598253189794
u[11] = +6.218695740390240090
u[12] = +6.175837314378373112
u[13] = +6.142359234424345102
u[14] = +6.115885561182135076
u[15] = +6.094780237887624276
u[16] = +6.078391827712722059
u[17] = +6.074956741464125884
u[18] = +6.234084939357549628
u[19] = +8.953054978972347186
u[20] = +38.535952308620171281
u[21] = +90.372020211576035841
u[22] = +99.357562247636820436
u[23] = +99.961042723675680582
u[24] = +99.997653560600116965
u[25] = +99.999858805729773391
u[26] = +99.999991508101661861
u[27] = +99.999999489474548205
u[28] = +99.999999969317897322
u[29] = +99.999999998156545070
$

So there we have it.  The sequence merely settles in at 100 over time.
However one may sit down with pen and paper and easily arrive at the
correct answer of 6 and then everyone should question how a piece of
computing machiney can be so vast horribly wrong with such a trivial
calculation.  Please note that most x86_64 hardware does a terrible mess
with the above and there are far better floating point implementations
in IBM Power and Fujitsu SPARC and other risc architectures.  We are
working on RISC-V with native 128 bit precision today and perhaps we may
have better hardware than x86_64 to the world shortly.  However that is
a side comment.

Let's look at a bit of code from our friendly Vincent Lefèvre as well as
Jean-Michel Muller ( http://perso.ens-lyon.fr/jean-michel.muller/ ) :


    /*
     * $Id: pb-muller1.c 47490 2011-11-10 17:07:08Z vinc17/ypig $
     *
     * Link with -lmpfr -lgmp *in that order*
     *
     * Example by Jean-Michel Muller:
     *   u_0 = 2
     *   u_1 = -4
     *   u_{n+1} = 111 - 1130 / u_n + 3000 / (u_n u_{n-1})
     * --> 6 (not 100)
     *
     * Arguments: precision, n
     */

    #include <stdio.h>
    #include <stdlib.h>
    #include <gmp.h>
    #include <mpfr.h>

    static const mp_rnd_t rnd[3] = { MPFR_RNDD, MPFR_RNDN, MPFR_RNDU };
    static int nep = 0;

    static void check_sign (mpfr_t x[3])
    {
      if (!nep && MPFR_SIGN (x[0]) != MPFR_SIGN (x[2]))
        nep = 1;
    }

    static void mdiv (mpfr_t tmp, unsigned long c, mpfr_t x[3], int r)
    {
      int neg;

      neg = MPFR_SIGN (x[0]) < 0;
      mpfr_set_ui (tmp, c, rnd[neg ? 2-r : r]);
      mpfr_div (tmp, tmp, x[2-r], rnd[r]);
    }

    static void display (mpfr_t x[3], long k)
    {
      int i;

      printf ("%6ld   ", k);
      for (i = 0; i < 3; i++)
        {
          size_t len;

len = (i | nep) == i ? mpfr_out_str (stdout, 10, 17, x[i],rnd[i]) : 0;

          if (i < 2)
            while (len++ < 24)
              putchar (' ');
        }
      putchar ('\n');
    }

    int main (int argc, char *argv[])
    {
      int prec, n, i, k;
      mpfr_t u[3], v[3], p[3], tmp;

      if (argc != 3)
        {
          fprintf (stderr, "Usage: pb-muller1 <prec> <n>\n");
          exit (EXIT_FAILURE);
        }

      prec = atoi (argv[1]);
      if (prec < MPFR_PREC_MIN || prec > MPFR_PREC_MAX)
        {
          fprintf (stderr, "pb-muller1: incorrect precision %d\n", prec);
          exit (EXIT_FAILURE);
        }
      mpfr_set_default_prec (prec);

      n = atoi (argv[2]);
      if (n <= 0)
        {
          fprintf (stderr, "pb-muller1: n must be positive\n");
          exit (EXIT_FAILURE);
        }

      for (i = 0; i < 3; i++)
        {
          mpfr_init (u[i]);
          mpfr_init (v[i]);
          mpfr_init (p[i]);
          mpfr_set_si (u[i], 2, rnd[i]);
          mpfr_set_si (v[i], -4, rnd[i]);
        }
      mpfr_init (tmp);

      printf("     N   ");
      printf("MPFR_RNDD");
      printf("               MPFR_RNDN");
      printf("               MPFR_RNDU\n");

      display (u, 0);
      display (v, 1);

      for (k = 2; k <= n; k++)
        {
          int s;

          check_sign (v);
          s = MPFR_SIGN(u[1]) * MPFR_SIGN(v[1]);
          for (i = 0; i < 3; i++)
            if ((i | nep) == i)
              {
                mpfr_mul (p[s < 0 ? 2-i : i], u[i], v[i], rnd[i]);
                mpfr_set (u[i], v[i], rnd[i]);
              }
          check_sign (p);
          for (i = 0; i < 3; i++)
            if ((i | nep) == i)
              {
                mpfr_set_si (v[i], 111, rnd[i]);
                mdiv (tmp, 1130, u, 2-i);
                mpfr_sub (v[i], v[i], tmp, rnd[2-i]);
                mdiv (tmp, 3000, p, i);
                mpfr_add (v[i], v[i], tmp, rnd[i]);
              }
          display (v, k);
        }

      for (i = 0; i < 3; i++)
        {
          mpfr_clear (u[i]);
          mpfr_clear (v[i]);
          mpfr_clear (p[i]);
        }
      mpfr_clear (tmp);

      return 0;
    }

    /*
    First 11 decimals of u_{100}: 6.0000000160, with prec >= 437,
    and >= 577 to guarantee the result in interval arithmetic.
    */

$

Here we use extended precision floating point libs to try to reproduce
whatever has happened in our terrible code above :

$ $CC $CFLAGS -I/usr/local/include -L/usr/local/lib-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -o pb-muller1 pb-muller1.c-lmpfr -lgmp

$ ./pb-muller1
Usage: pb-muller1 <prec> <n>
$ ./pb-muller1 113 30
     N   MPFR_RNDD               MPFR_RNDN               MPFR_RNDU
     0   2.0000000000000000      2.0000000000000000      2.0000000000000000

1 -4.0000000000000000 -4.0000000000000000-4.00000000000000002 1.8500000000000000e1 1.8500000000000000e11.8500000000000000e1

     3   9.3783783783783783      9.3783783783783784      9.3783783783783784
     4   7.8011527377521613      7.8011527377521614      7.8011527377521614
     5   7.1544144809752493      7.1544144809752494      7.1544144809752494
     6   6.8067847369236329      6.8067847369236330      6.8067847369236330
     7   6.5926327687044383      6.5926327687044384      6.5926327687044384
     8   6.4494659337902879      6.4494659337902880      6.4494659337902880
     9   6.3484520566543571      6.3484520566543571      6.3484520566543572
    10   6.2744385982163279      6.2744385982163279      6.2744385982163280
    11   6.2186957398023977      6.2186957398023978      6.2186957398023978
    12   6.1758373049212301      6.1758373049212301      6.1758373049212302
    13   6.1423590812383559      6.1423590812383559      6.1423590812383560
    14   6.1158830665510803      6.1158830665510808      6.1158830665510812
    15   6.0947394393336623      6.0947394393336811      6.0947394393337000
    16   6.0777223048464167      6.0777223048472427      6.0777223048480687
    17   6.0639403224632851      6.0639403224998087      6.0639403225363323
    18   6.0527217593924258      6.0527217610161519      6.0527217626398778
    19   6.0435520376871057      6.0435521101892642      6.0435521826914195
    20   6.0360286321323002      6.0360318810817798      6.0360351300311756
    21   6.0297013055414662      6.0298473250226262      6.0299933444426353
    22   6.0181710657610797      6.0247496523456908      6.0313281169442034
    23   5.7234455935329964      6.0205399837103304      6.3173863825532198

24 -7.6979694052129188 6.01705825149564511.9224751938450067e1

    25                           6.0141748176010937
    26                           6.0117829758099306
    27                           6.0097744237027211
    28                           6.0077081713888591
    29                           5.9993587346059263
    30                           5.8818446518449744
$

I use 113 bits for obvious reasons here.

Please seehttps://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format.


The above results illustrate the wildly different results we can get
when we use different methods of "rounding" on the ULP which is the unit
of least precision. We seem to circle the correct result with "rounding
to even" but we don't quite get there.  In fact we can not.  Using 113
bits of binary precision in the significand will not be sufficient. My
own experiments show that we need a minimal 172 bits and I reveal the
state of various floating point flags :


    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <gmp.h>
    #include <mpfr.h>

    #define PREC 172    /* based on experimental results */
    #define LOOPCNT 32

    int mpfr_check_flags( int status );

    int main (int argc, char *argv[]) {
        mpfr_t u, v, w, t0, t1, t2, c111, c1130, c3000;
        mpfr_flags_t mpfr_flags;

        int mpfr_underflow_flag, mpfr_overflow_flag, mpfr_divby0_flag,
            mpfr_nanflag_flag, mpfr_inexflag_flag, mpfr_erangeflag_flag;
        int i, max = LOOPCNT;
        int inex; /* ret from mpfr calls */

        printf(" GMP vers : %d.%d.%d\n", __GNU_MP_VERSION,
                  __GNU_MP_VERSION_MINOR, __GNU_MP_VERSION_PATCHLEVEL );

        printf("MPFR vers : %-12s\nMPFR header: %s (based on %d.%d.%d)\n",
              mpfr_get_version (), MPFR_VERSION_STRING, MPFR_VERSION_MAJOR,
              MPFR_VERSION_MINOR, MPFR_VERSION_PATCHLEVEL);

        printf("MPFR_VERSION_MAJOR = %d\n", MPFR_VERSION_MAJOR);
        printf("MPFR_VERSION_MINOR = %d\n", MPFR_VERSION_MINOR);
        printf("MPFR_VERSION_PATCHLEVEL = %d\n", MPFR_VERSION_PATCHLEVEL);

        if (mpfr_buildopt_tls_p()!=0)
            printf("          : compiled as thread safe using TLS\n");

        if (mpfr_buildopt_float128_p()!=0)
            printf("          : __float128 support enabled\n");

        if (mpfr_buildopt_decimal_p()!=0)
            printf("          : decimal float support enabled\n");

        if (mpfr_buildopt_gmpinternals_p()!=0)
            printf("          : compiled with GMP internals\n");

        if (mpfr_buildopt_sharedcache_p()!=0)
            printf("          : threads share cache per MPFR const\n");

        printf("MPFR thresholds file used at compile time : %s\n",
                                          mpfr_buildopt_tune_case () );

        printf("\n\n");

        mpfr_set_default_prec ((mpfr_prec_t) PREC);

mpfr_inits(u, v, w, t0, t1, t2, c111, c1130, c3000, (mpfr_ptr)NULL);


        inex = mpfr_set_si(u, (long) 2, MPFR_RNDN);
        if ( inex != 0 ){
            fprintf(stderr,"FAIL : mpfr_set_si(u, (long) 2, MPFR_RNDN)\n");
            goto fail_out;
        }

        inex = mpfr_set_si(v, (long) -4, MPFR_RNDN);
        if ( inex != 0 ){

fprintf(stderr,"FAIL : mpfr_set_si(v, (long) -4,MPFR_RNDN)\n");

            goto fail_out;
        }

        inex = mpfr_set_si(c111, (long) 111, MPFR_RNDN);
        if ( inex != 0 ){

fprintf(stderr,"FAIL : mpfr_set_si(c111, (long) 111,MPFR_RNDN)\n");

            goto fail_out;
        }

        inex = mpfr_set_si(c1130, (long) 1130, MPFR_RNDN);
        if ( inex != 0 ){

fprintf(stderr,"FAIL : mpfr_set_si(c1130, (long) 1130,MPFR_RNDN)\n");

            goto fail_out;
        }

        inex = mpfr_set_si(c3000, (long) 3000, MPFR_RNDN);
        if ( inex != 0 ){

fprintf(stderr,"FAIL : mpfr_set_si(c3000, (long) 3000,MPFR_RNDN)\n");

            goto fail_out;
        }

        for (i = 3; i <= max; i++){
            mpfr_clear_flags();
            inex = mpfr_div(t0, c1130, v, MPFR_RNDN);
            inex = mpfr_check_flags(inex);
            if ( inex != 0 ){

fprintf(stderr,"FAIL : mpfr_div(t0, c1130, v,MPFR_RNDN)\n");

                fprintf(stderr,"     : returned %i\n", inex );
                mpfr_printf("t0 = %.30Rg\n", t0);
                goto fail_out;
            }
            mpfr_printf("t0 = 1130 / v = %.30Rg\n", t0);

            mpfr_clear_flags();
            inex = mpfr_sub(w, c111, t0, MPFR_RNDN);
            inex = mpfr_check_flags(inex);
            if ( inex != 0 ){

fprintf(stderr,"FAIL : mpfr_sub(w, c111, t0,MPFR_RNDN)\n");

                goto fail_out;
            }
            mpfr_printf("w = 111 - t0 = %.30Rg\n", w);

            mpfr_clear_flags();
            inex = mpfr_mul(t1, v, u, MPFR_RNDN);
            inex = mpfr_check_flags(inex);
            if ( inex != 0 ){
                fprintf(stderr,"FAIL : mpfr_mul(t1, v, u, MPFR_RNDN)\n");
                goto fail_out;
            }
            mpfr_printf("t1 = v * u = %.30Rg\n", t1);

            mpfr_clear_flags();
            inex = mpfr_div(t2, c3000, t1, MPFR_RNDN);
            inex = mpfr_check_flags(inex);
            if ( inex != 0 ){

fprintf(stderr,"FAIL : mpfr_div(t2, c3000, t1,MPFR_RNDN)\n");

                goto fail_out;
            }
            mpfr_printf("t2 = 3000 / t1 = %.30Rg\n", t2);

            mpfr_clear_flags();
            inex = mpfr_add(w, w, t2, MPFR_RNDN);
            inex = mpfr_check_flags(inex);
            if ( inex != 0 ){
                fprintf(stderr,"FAIL : mpfr_add(w, w, t2, MPFR_RNDN)\n");
                goto fail_out;
            }
            mpfr_printf("w = w + t2 = %.30Rg\n", w);

            mpfr_clear_flags();
            inex = mpfr_set(u, v, MPFR_RNDN);
            inex = mpfr_check_flags(inex);
            if ( inex != 0 ){
                fprintf(stderr,"FAIL : mpfr_set(u, v, MPFR_RNDN)\n");
                goto fail_out;
            }
            mpfr_printf ("u = v = %.30Rg\n", u);

            mpfr_clear_flags();
            inex = mpfr_set(v, w, MPFR_RNDN);
            inex = mpfr_check_flags(inex);
            if ( inex != 0 ){
                fprintf(stderr,"FAIL : mpfr_set(v, w, MPFR_RNDN)\n");
                goto fail_out;
            }
            mpfr_printf ("v = w = %.30Rg\n\n", v);
            printf("u%02i = ", i);
            mpfr_printf ("%.30Rg\n", v);
        }

mpfr_clears( u, v, w, t0, t1, t2, c111, c1130, c3000,(mpfr_ptr) NULL);

        return EXIT_SUCCESS;
    fail_out:

mpfr_clears( u, v, w, t0, t1, t2, c111, c1130, c3000,(mpfr_ptr) NULL);

        return EXIT_FAILURE;
    }

$ $CC $CFLAGS -I/usr/local/include -L/usr/local/lib-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -o hfpa_page_11hfpa_page_11.c -lmpfr -lgmp


$ ./hfpa_page_11
 GMP vers : 6.1.2
MPFR vers : 4.0.1-p13
MPFR header: 4.0.1-p13 (based on 4.0.1)
MPFR_VERSION_MAJOR = 4
MPFR_VERSION_MINOR = 0
MPFR_VERSION_PATCHLEVEL = 1
          : compiled as thread safe using TLS
MPFR thresholds file used at compile time : src/amd/k8/mparam.h


t0 = 1130 / v = -282.5
w = 111 - t0 = 393.5
t1 = v * u = -8
t2 = 3000 / t1 = -375
w = w + t2 = 18.5
u = v = -4
v = w = 18.5

u03 = 18.5
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 61.0810810810810810810810810811
w = 111 - t0 = 49.9189189189189189189189189189
t1 = v * u = -74
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = -40.5405405405405405405405405405
w = w + t2 = 9.37837837837837837837837837838
u = v = 18.5
v = w = 9.37837837837837837837837837838

u04 = 9.37837837837837837837837837838
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 120.489913544668587896253602305
w = 111 - t0 = -9.48991354466858789625360230548
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 173.5
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 17.2910662824207492795389048991
w = w + t2 = 7.80115273775216138328530259366
u = v = 9.37837837837837837837837837838
v = w = 7.80115273775216138328530259366

u05 = 7.80115273775216138328530259366
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 144.850387883265607683782785371
w = 111 - t0 = -33.8503878832656076837827853713
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 73.1621621621621621621621621622
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 41.0048023642408570373106760251
w = w + t2 = 7.15441448097524935352789065386
u = v = 7.80115273775216138328530259366
v = w = 7.15441448097524935352789065386

u06 = 7.15441448097524935352789065386
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 157.944441575876490938193834874
w = 111 - t0 = -46.9444415758764909381938348738
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 55.8126801152737752161383285303
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 53.75122631280012392213559147
w = w + t2 = 6.80678473692363298394175659627
u = v = 7.15441448097524935352789065386
v = w = 6.80678473692363298394175659627

u07 = 6.80678473692363298394175659627
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 166.010832378799487206717895424
w = 111 - t0 = -55.0108323787994872067178954235
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 48.6985592907277428888067971925
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 61.6034651475039255994598981999
w = w + t2 = 6.59263276870443839274200277637
u = v = 6.80678473692363298394175659627
v = w = 6.59263276870443839274200277637

u08 = 6.59263276870443839274200277637
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 171.403449827232486505953949374
w = 111 - t0 = -60.4034498272324865059539493745
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 44.874632106159962823359322559
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 66.8529157610227744748224285946
w = w + t2 = 6.44946593379028796886847922015
u = v = 6.59263276870443839274200277637
v = w = 6.44946593379028796886847922015

u09 = 6.44946593379028796886847922015
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 175.208305866019214125873951209
w = 111 - t0 = -64.2083058660192141258739512095
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 42.51896045574882232016203054
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 70.5567579226735712729746427703
w = w + t2 = 6.34845205665435714710069156081
u = v = 6.44946593379028796886847922015
v = w = 6.34845205665435714710069156081

u10 = 6.34845205665435714710069156081
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 177.996146133851648579093411262
w = 111 - t0 = -66.996146133851648579093411262
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 40.9441252716931676575532714216
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 73.270584732067976492922789724
w = w + t2 = 6.27443859821632791382937846207
u = v = 6.34845205665435714710069156081
v = w = 6.27443859821632791382937846207

u11 = 6.27443859821632791382937846207
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 180.095793800776349973236624212
w = 111 - t0 = -69.0957938007763499732366242119
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 39.8329726231979286181076071689
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 75.3144895405787477615577757815
w = w + t2 = 6.21869573980239778832115156959
u = v = 6.27443859821632791382937846207
v = w = 6.21869573980239778832115156959

u12 = 6.21869573980239778832115156959
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 181.71012818130033215181419855
w = 111 - t0 = -70.7101281813003321518141985497
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 39.0188245803796070521231630828
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 76.8859654862215622716775384113
w = w + t2 = 6.17583730492123011986333986151
u = v = 6.21869573980239778832115156959
v = w = 6.17583730492123011986333986151

u13 = 6.17583730492123011986333986151
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 182.971141273355259617962054216
w = 111 - t0 = -71.9711412733552596179620542163
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 38.4056531378263756715326672655
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 78.1135003545936155573081943698
w = w + t2 = 6.14235908123835593934614015355
u = v = 6.17583730492123011986333986151
v = w = 6.14235908123835593934614015355

u14 = 6.14235908123835593934614015355
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 183.968404493242624565635107533
w = 111 - t0 = -72.9684044932426245656351075329
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 37.9342103541335313184967384766
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 79.0842875597937053293793082179
w = w + t2 = 6.11588306655108076374420068497
u = v = 6.14235908123835593934614015355
v = w = 6.11588306655108076374420068497

u15 = 6.11588306655108076374420068497
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 184.764814451764677499945188349
w = 111 - t0 = -73.7648144517646774999451883494
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 37.5659498936219153328075416891
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 79.8595538910983586282652275967
w = w + t2 = 6.09473943933368112832003924736
u = v = 6.11588306655108076374420068497
v = w = 6.09473943933368112832003924736

u16 = 6.09473943933368112832003924736
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 185.405793184087190266717806473
w = 111 - t0 = -74.405793184087190266717806473
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 37.2747137320618884011862075346
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 80.4835154889344330029996346197
w = w + t2 = 6.07772230484724273628182814674
u = v = 6.09473943933368112832003924736
v = w = 6.07772230484724273628182814674

u17 = 6.07772230484724273628182814674
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 185.924914519173870218307712212
w = 111 - t0 = -74.9249145191738702183077122119
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 37.042133832670492411520431721
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 80.9888548416736789735738791443
w = w + t2 = 6.06394032249980875526616693243
u = v = 6.07772230484724273628182814674
v = w = 6.06394032249980875526616693243

u18 = 6.06394032249980875526616693243
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 186.347480335058267383579967778
w = 111 - t0 = -75.347480335058267383579967778
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.8549453533196700991001096141
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 81.4002020960744195769362518193
w = w + t2 = 6.05272176101615219335628404129
u = v = 6.06394032249980875526616693243
v = w = 6.05272176101615219335628404129

u19 = 6.05272176101615219335628404129
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 186.692870516204205980381685952
w = 111 - t0 = -75.6928705162042059803816859523
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.7033435474978963079278362567
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 81.7364226263934748481591633164
w = w + t2 = 6.0435521101892688677774773641
u = v = 6.05272176101615219335628404129
v = w = 6.0435521101892688677774773641

u20 = 6.0435521101892688677774773641
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 186.976132479250061286265756921
w = 111 - t0 = -75.9761324792500612862657569211
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.5799393711776741269191244542
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.0121643603319180662764005427
w = w + t2 = 6.03603188108185678001064362156
u = v = 6.0435521101892688677774773641
v = w = 6.03603188108185678001064362156

u21 = 6.03603188108185678001064362156
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 187.209084090766363397143239031
w = 111 - t0 = -76.2090840907663633971432390312
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.4790732120819575455522510051
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.2389314157902652538562503843
w = w + t2 = 6.02984732502390185671301135315
u = v = 6.03603188108185678001064362156
v = w = 6.02984732502390185671301135315

u22 = 6.02984732502390185671301135315
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 187.401096427515395813994063809
w = 111 - t0 = -76.4010964275153958139940638091
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.3963506919004245801170798372
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.4258460798822437127375842389
w = w + t2 = 6.02474965236684789874352042985
u = v = 6.02984732502390185671301135315
v = w = 6.02474965236684789874352042985

u23 = 6.02474965236684789874352042985
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 187.559660600349561648427131321
w = 111 - t0 = -76.5596606003495616484271313206
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.3283205752629204238431248847
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.5802005844110777108582694273
w = w + t2 = 6.02053998406151606243113810664
u = v = 6.02474965236684789874352042985
v = w = 6.02053998406151606243113810664

u24 = 6.02053998406151606243113810664
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 187.690805640608133312326116058
w = 111 - t0 = -76.6908056406081333123261160582
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.272246176035326886178724729
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.7078638979371209235032988195
w = w + t2 = 6.0170582573289876111771827613
u = v = 6.02053998406151606243113810664
v = w = 6.0170582573289876111771827613

u25 = 6.0170582573289876111771827613
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 187.799411551919152411333134492
w = 111 - t0 = -76.7994115519191524113331344923
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.2259398246766766867425191838
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.8135864664699713738641132522
w = w + t2 = 6.01417491455081896253097875987
u = v = 6.0170582573289876111771827613
v = w = 6.01417491455081896253097875987

u26 = 6.01417491455081896253097875987
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 187.889447190179764969201236571
w = 111 - t0 = -76.8894471901797649692012365712
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.1876408306188637229490105522
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.9012317780510986425852750301
w = w + t2 = 6.01178458787133367338403845895
u = v = 6.01417491455081896253097875987
v = w = 6.01178458787133367338403845895

u27 = 6.01178458787133367338403845895
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 187.964153319757081280027483314
w = 111 - t0 = -76.9641533197570812800274833137
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.1559240600590085878407693164
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 82.9739545590555658478143724685
w = w + t2 = 6.00980123929848456778688915479
u = v = 6.01178458787133367338403845895
v = w = 6.00980123929848456778688915479

u28 = 6.00980123929848456778688915479
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 188.026185060972710385415447864
w = 111 - t0 = -77.0261850609727103854154478637
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.1296304665846704072244722298
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 83.0343394398849393132364128061
w = w + t2 = 6.00815437891222892782096494236
u = v = 6.00980123929848456778688915479
v = w = 6.00815437891222892782096494236

u29 = 6.00815437891222892782096494236
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 188.0777238291579164286779791
w = 111 - t0 = -77.0777238291579164286779790997
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.1078136322833302456565987844
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 83.0845099221891221872109178848
w = w + t2 = 6.00678609303120575853293878513
u = v = 6.00815437891222892782096494236
v = w = 6.00678609303120575853293878513

u30 = 6.00678609303120575853293878513
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 188.120566056276503242641451445
w = 111 - t0 = -77.1205660562765032426414514453
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.0896981680345182060442268243
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 83.1262147450479235105736445903
w = w + t2 = 6.00564868877142026793219314497
u = v = 6.00678609303120575853293878513
v = w = 6.00564868877142026793219314497

u31 = 6.00564868877142026793219314497
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t0 = 1130 / v = 188.156194036578734087817277753
w = 111 - t0 = -77.1561940365787340878172777526
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t1 = v * u = 36.0746470233432633440888930245
INFO : mpfr_inexflag_flag is set.
     : clear flags and carry on.
t2 = 3000 / t1 = 83.1608968497669092896537174097
w = w + t2 = 6.0047028131881752018364396571
u = v = 6.00564868877142026793219314497
v = w = 6.0047028131881752018364396571

u32 = 6.0047028131881752018364396571
$
$ echo $?
0

We get the exact same result on IBM Power hardware and Oracle SPARC
given that the implementation uses cross platform floating point libs
libgmp and libmpfr.

However why are we seeing such horrific results from trivial code ?

Let me quote one of the authors of (3) Vincent Lefèvre :

    ... for any computation system on approximated numbers with a format
    of bounded length, at least one operation will be inexact (because
    if the sequence is computed exactly, you will get an infinity of
    different numbers), and as soon as you get an error, the limit will
    change from 6 to 100.  Thus MPFR cannot help you to find the limit
    (and obviously unums won't help you either). Well, with arbitrary
    precision, you can increase the precision and observe the various
    values, so that you can conjecture that the limit is 6, but this is
    not a proof. Interval arithmetic in various precisions can be useful
    to see what's going on with some additional details.
    On this ( trivial ) example, MPFR can be used too to get bounds,
    without needing an interval arithmetic package ...

However that is not in any book. Let's try to sum up here and merely say
that some serious reading and experiments are needed to get a good
handle on why numerical computation is as much art as it is science. If
we wander into the problem without sufficient study and VERY careful
consideration then we are doomed to repeat the errors of the past. I want
to point out that this has been quite long enough and I did not bother
to drag in output data from the IBM Power or RISC-V systems but all one
needs to know is that the results are poor if rounding errors are not
trapped and handled.  Also, there will ALWAYS be inexact flags set and
computation rounding errors thrown in every language we have. People are
often "mindless" about how modern computing machinary works and have no

clue about roundoff errors. However I didn't say that. Kahan did. See(6) above.


Dennis Clarke
ye old UNIX/Linux greybeard

ps: https://en.wikipedia.org/wiki/Minimax_approximation_algorithm


lastly just for fun :

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <fenv.h>
#pragma STDC FENV_ACCESS ON

size_t bin_printf ( uint8_t* f, size_t n );

int
main (int argc, char* argv[])
{

    float foo = 1.100f;
    float epsilon = 1.0E-7f; /* this is _about_ 0.000001 */
    int fpe;
    int j;

    feclearexcept(FE_ALL_EXCEPT);
    if (fesetround(FE_TONEAREST)!=0){
        fprintf(stderr,"ERROR : can not set floating point rounding!\n");
        return EXIT_FAILURE;
    }

    printf("starting value of foo = %8.7e\n", (double)foo );
    bin_printf((uint8_t*)&foo, (size_t)sizeof(foo));
    for ( j=0; fabsf(foo)>epsilon; j++ ){
        /* do a trivial subtraction */
        foo = foo - 0.10f;
        fpe = fetestexcept(FE_ALL_EXCEPT);
        if (fpe!=0){
            printf("INFO : Exception raised was");
            if(fpe & FE_INEXACT) printf(" FE_INEXACT");
            if(fpe & FE_DIVBYZERO) printf(" FE_DIVBYZERO");
            if(fpe & FE_UNDERFLOW) printf(" FE_UNDERFLOW");
            if(fpe & FE_OVERFLOW) printf(" FE_OVERFLOW");
            if(fpe & FE_INVALID) printf(" FE_INVALID");
            printf("\n");
        }
        printf("foo = %8.7e\n", (double)foo );
        bin_printf((uint8_t*)&foo, (size_t)sizeof(foo));
        feclearexcept(FE_ALL_EXCEPT);
    }
    printf("\nfinal value of foo = %8.7e\n", (double)foo );
    bin_printf((uint8_t*)&foo, (size_t)sizeof(foo));
    return EXIT_SUCCESS;
}

$ ./subtraction
starting value of foo = 1.1000000e+00

 00111111 10001100 11001100 11001101
INFO : Exception raised was FE_INEXACT
foo = 1.0000000e+00

 00111111 10000000 00000000 00000000
INFO : Exception raised was FE_INEXACT
foo = 8.9999998e-01

 00111111 01100110 01100110 01100110
INFO : Exception raised was FE_INEXACT
foo = 7.9999995e-01

 00111111 01001100 11001100 11001100
INFO : Exception raised was FE_INEXACT
foo = 6.9999993e-01

 00111111 00110011 00110011 00110010
INFO : Exception raised was FE_INEXACT
foo = 5.9999990e-01

 00111111 00011001 10011001 10011000
INFO : Exception raised was FE_INEXACT
foo = 4.9999991e-01

 00111110 11111111 11111111 11111101
INFO : Exception raised was FE_INEXACT
foo = 3.9999992e-01

 00111110 11001100 11001100 11001010
INFO : Exception raised was FE_INEXACT
foo = 2.9999992e-01

 00111110 10011001 10011001 10010111
INFO : Exception raised was FE_INEXACT
foo = 1.9999993e-01

 00111110 01001100 11001100 11001000
foo = 9.9999927e-02

 00111101 11001100 11001100 11000011
foo = -7.4505806e-08

 10110011 10100000 00000000 00000000

final value of foo = -7.4505806e-08

 10110011 10100000 00000000 00000000
$


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Question about floating point

Reply via email to