Re: Floating Point + Threads?

2011-04-20 Thread Don

Sean Kelly wrote:

On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:


On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com 
wrote:


The dmd startup code (actually the C startup code) does an fninit. I never 
thought about new thread starts. So, yeah, druntime should do an fninit on 
thread creation.

The documentation I've found on fninit seems to indicate it defaults to 64-bit 
precision, which means that by default we aren't seeing the benefit of D's 
reals. I'd much prefer 80-bit precision by default.


There is no option to set 80-bit precision via the FPU control word.  


??? Yes there is.

enum PrecisionControl : short {
PRECISION80 = 0x300,
PRECISION64 = 0x200,
PRECISION32 = 0x000
};

/** Set the number of bits of precision used by 'real'.
 *
 * Returns: the old precision.
 * This is not supported on all platforms.
 */
PrecisionControl reduceRealPrecision(PrecisionControl prec) {
   version(D_InlineAsm_X86) {
short cont;
asm {
fstcw cont;
mov CX, cont;
mov AX, cont;
and EAX, 0x0300; // Form the return value
and CX,  0xFCFF;
or  CX,  prec;
mov cont, CX;
fldcw cont;
}
} else {
   assert(0, Not yet supported);
}
}


Re: Floating Point + Threads?

2011-04-20 Thread Sean Kelly
On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 newshou...@digitalmars.com wrote:
 
 The dmd startup code (actually the C startup code) does an fninit. I never 
 thought about new thread starts. So, yeah, druntime should do an fninit on 
 thread creation.
 The documentation I've found on fninit seems to indicate it defaults to 
 64-bit precision, which means that by default we aren't seeing the benefit 
 of D's reals. I'd much prefer 80-bit precision by default.
 There is no option to set 80-bit precision via the FPU control word.  
 
 ??? Yes there is.
 
 enum PrecisionControl : short {
PRECISION80 = 0x300,
PRECISION64 = 0x200,
PRECISION32 = 0x000
 };

So has Intel deprecated 80-bit FPU support?  Why do the docs for this say that 
64-bit is the highest precision?  And more importantly, does this mean that we 
should be setting the PC field explicitly instead of relying on fninit?  The 
docs say that fninit initializes to 64-bit precision.  Or is that inaccurate as 
well?

Re: Floating Point + Threads?

2011-04-20 Thread JimBob

Sean Kelly s...@invisibleduck.org wrote in message 
news:mailman.3597.1303316625.4748.digitalmar...@puremagic.com...
On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 newshou...@digitalmars.com wrote:

 The dmd startup code (actually the C startup code) does an fninit. I 
 never thought about new thread starts. So, yeah, druntime should do an 
 fninit on thread creation.
 The documentation I've found on fninit seems to indicate it defaults to 
 64-bit precision, which means that by default we aren't seeing the 
 benefit of D's reals. I'd much prefer 80-bit precision by default.
 There is no option to set 80-bit precision via the FPU control word.

 ??? Yes there is.

 enum PrecisionControl : short {
PRECISION80 = 0x300,
PRECISION64 = 0x200,
PRECISION32 = 0x000
 };

So has Intel deprecated 80-bit FPU support?  Why do the docs for this say 
that 64-bit
 is the highest precision?  And more importantly, does this mean that we 
 should be setting
 the PC field explicitly instead of relying on fninit?  The docs say that 
 fninit initializes to
 64-bit precision.  Or is that inaccurate as well?=

You misread the docs, it's talking about precision which is just the size of 
the mantisa, not the actual full size of the floating point data. IE...

80 float = 64 bit precision
64 float = 53 bit precision
32 float = 24 bit precision





Re: Floating Point + Threads?

2011-04-20 Thread Sean Kelly
On Apr 20, 2011, at 10:46 AM, JimBob wrote:
 
 Sean Kelly s...@invisibleduck.org wrote in message 
 news:mailman.3597.1303316625.4748.digitalmar...@puremagic.com...
 On Apr 20, 2011, at 5:06 AM, Don wrote:
 
 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 newshou...@digitalmars.com wrote:
 
 The dmd startup code (actually the C startup code) does an fninit. I 
 never thought about new thread starts. So, yeah, druntime should do an 
 fninit on thread creation.
 The documentation I've found on fninit seems to indicate it defaults to 
 64-bit precision, which means that by default we aren't seeing the 
 benefit of D's reals. I'd much prefer 80-bit precision by default.
 There is no option to set 80-bit precision via the FPU control word.
 
 ??? Yes there is.
 
 enum PrecisionControl : short {
   PRECISION80 = 0x300,
   PRECISION64 = 0x200,
   PRECISION32 = 0x000
 };
 
 So has Intel deprecated 80-bit FPU support?  Why do the docs for this say 
 that 64-bit
 is the highest precision?  And more importantly, does this mean that we 
 should be setting
 the PC field explicitly instead of relying on fninit?  The docs say that 
 fninit initializes to
 64-bit precision.  Or is that inaccurate as well?=
 
 You misread the docs, it's talking about precision which is just the size of 
 the mantisa, not the actual full size of the floating point data. IE...
 
 80 float = 64 bit precision
 64 float = 53 bit precision
 32 float = 24 bit precision

Oops, you're right.  So to summarize: fninit does what we want because it sets 
64-bit precision, which is effectively 80-bit mode.  Is this correct?

Re: Floating Point + Threads?

2011-04-20 Thread Don

Sean Kelly wrote:

On Apr 20, 2011, at 10:46 AM, JimBob wrote:
Sean Kelly s...@invisibleduck.org wrote in message 
news:mailman.3597.1303316625.4748.digitalmar...@puremagic.com...

On Apr 20, 2011, at 5:06 AM, Don wrote:


Sean Kelly wrote:

On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
newshou...@digitalmars.com wrote:
The dmd startup code (actually the C startup code) does an fninit. I 
never thought about new thread starts. So, yeah, druntime should do an 
fninit on thread creation.
The documentation I've found on fninit seems to indicate it defaults to 
64-bit precision, which means that by default we aren't seeing the 
benefit of D's reals. I'd much prefer 80-bit precision by default.

There is no option to set 80-bit precision via the FPU control word.

??? Yes there is.

enum PrecisionControl : short {
  PRECISION80 = 0x300,
  PRECISION64 = 0x200,
  PRECISION32 = 0x000
};

So has Intel deprecated 80-bit FPU support?  Why do the docs for this say 
that 64-bit
is the highest precision?  And more importantly, does this mean that we 
should be setting
the PC field explicitly instead of relying on fninit?  The docs say that 
fninit initializes to

64-bit precision.  Or is that inaccurate as well?=
You misread the docs, it's talking about precision which is just the size of 
the mantisa, not the actual full size of the floating point data. IE...


80 float = 64 bit precision
64 float = 53 bit precision
32 float = 24 bit precision


Oops, you're right.  So to summarize: fninit does what we want because it sets 
64-bit precision, which is effectively 80-bit mode.  Is this correct?


Yes.


Re: Floating Point + Threads?

2011-04-19 Thread Sean Kelly
On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:

 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 newshou...@digitalmars.com wrote:
 
 
 The dmd startup code (actually the C startup code) does an fninit. I never 
 thought about new thread starts. So, yeah, druntime should do an fninit on 
 thread creation.
 
 The documentation I've found on fninit seems to indicate it defaults to 
 64-bit precision, which means that by default we aren't seeing the benefit of 
 D's reals. I'd much prefer 80-bit precision by default.

There is no option to set 80-bit precision via the FPU control word.  Section 
8.1.5.2 of the Intel 64 SDM says the following:

The precision-control (PC) field (bits 8 and 9 of the x87 FPU control word) 
determines the precision (64, 53, or 24 bits) of floating-point calculations 
made by the x87 FPU (see Table 8-2). The default precision is double extended 
precision, which uses the full 64-bit significand available with the double 
extended-precision floating-point format of the x87 FPU data registers. This 
setting is best suited for most applications, because it allows applications to 
take full advantage of the maximum precision available with the x87 FPU data 
registers.

So it sounds like finit/fninit does what we want.

Re: Floating Point + Threads?

2011-04-19 Thread Robert Jacques
On Tue, 19 Apr 2011 14:18:46 -0400, Sean Kelly s...@invisibleduck.org  
wrote:



On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:

On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright  
newshou...@digitalmars.com wrote:



The dmd startup code (actually the C startup code) does an fninit. I  
never thought about new thread starts. So, yeah, druntime should do an  
fninit on thread creation.


The documentation I've found on fninit seems to indicate it defaults to  
64-bit precision, which means that by default we aren't seeing the  
benefit of D's reals. I'd much prefer 80-bit precision by default.


There is no option to set 80-bit precision via the FPU control word.   
Section 8.1.5.2 of the Intel 64 SDM says the following:


The precision-control (PC) field (bits 8 and 9 of the x87 FPU control  
word) determines the precision (64, 53, or 24 bits) of floating-point  
calculations made by the x87 FPU (see Table 8-2). The default precision  
is double extended precision, which uses the full 64-bit significand  
available with the double extended-precision floating-point format of  
the x87 FPU data registers. This setting is best suited for most  
applications, because it allows applications to take full advantage of  
the maximum precision available with the x87 FPU data registers.


So it sounds like finit/fninit does what we want.


Yes, that sounds right. Thanks for clarifying.


Re: Floating Point + Threads?

2011-04-16 Thread Robert Jacques

On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha dsim...@yahoo.com wrote:

I'm trying to debug an extremely strange bug whose symptoms appear in a  
std.parallelism example, though I'm not at all sure the root cause is in  
std.parallelism.  The bug report is at  
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717  
.


Basically, the example in question sums up all the elements of a lazy  
range (actually, std.algorithm.map) in parallel.  It uses  
taskPool.reduce, which divides the summation into work units to be  
executed in parallel.  When executed in parallel, the results of the  
summation are non-deterministic after about the 12th decimal place, even  
though all of the following properties are true:


1.  The work is divided into work units in a deterministic fashion.

2.  Within each work unit, the summation happens in a deterministic  
order.


3.  The final summation of the results of all the work units is done in  
a deterministic order.


4.  The smallest term in the summation is about 5e-10.  This means the  
difference across runs is about two orders of magnitude smaller than the  
smallest term.  It can't be a concurrency bug where some terms sometimes  
get skipped.


5.  The results for the individual tasks, not just the final summation,  
differ in the low-order bits.  Each task is executed in a single thread.


6.  The rounding mode is apparently the same in all of the threads.

7.  The bug appears even on machines with only one core, as long as the  
number of task pool threads is manually set to 0.  Since it's a single  
core machine, it can't be a low level memory model issue.


What could possibly cause such small, non-deterministic differences in  
floating point results, given everything above?  I'm just looking for  
suggestions here, as I don't even know where to start hunting for a bug  
like this.


Well, on one hand floating point math is not cumulative, and running sums  
have many known issues (I'd recommend looking up Khan summation). On the  
hand, it should be repeatably different.
As for suggestions? First and foremost, you should always add small to  
large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should  
the answer be better, but if your error rate goes down, you have a good  
idea of where the problem is. I'd also try isolating your implementation's  
numerics, from the underlying concurrency. i.e. use a task pool of 1 and  
don't let the host thread join it, so the entire job is done by one  
worker. The other thing to try is isolation /removing map and iota from  
the equation.


Re: Floating Point + Threads?

2011-04-16 Thread Walter Bright

On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:

On 4/15/11 10:22 PM, dsimcha wrote:

I'm trying to debug an extremely strange bug whose symptoms appear in a
std.parallelism example, though I'm not at all sure the root cause is in
std.parallelism. The bug report is at
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .


Does the scheduling affect the summation order?


That's a good thought. FP addition results can differ dramatically depending on 
associativity.


Re: Floating Point + Threads?

2011-04-16 Thread Fawzi Mohamed


On 16-apr-11, at 09:41, Walter Bright wrote:


On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:

On 4/15/11 10:22 PM, dsimcha wrote:
I'm trying to debug an extremely strange bug whose symptoms appear  
in a
std.parallelism example, though I'm not at all sure the root cause  
is in

std.parallelism. The bug report is at
https://github.com/dsimcha/std.parallelism/issues/ 
1#issuecomment-1011717 .


Does the scheduling affect the summation order?


That's a good thought. FP addition results can differ dramatically  
depending on associativity.


yes, one can avoid this by using a tree algorithm with a fixed  
blocksize, then the results will be the same bothe in single and  
parallel case.

Normally one uses atomic sumation though.
In blip I spent quite a bit of thought on tree like algorithms and  
their parallelization exactly because the parallelize well and are  
independent form the paralleization


Fawzi


Re: Floating Point + Threads?

2011-04-16 Thread Fawzi Mohamed


On 16-apr-11, at 05:22, dsimcha wrote:

I'm trying to debug an extremely strange bug whose symptoms appear  
in a std.parallelism example, though I'm not at all sure the root  
cause is in std.parallelism.  The bug report is at https://github.com/dsimcha/std.parallelism/issues/1 
#issuecomment-1011717 .


Basically, the example in question sums up all the elements of a  
lazy range (actually, std.algorithm.map) in parallel.  It uses  
taskPool.reduce, which divides the summation into work units to be  
executed in parallel.  When executed in parallel, the results of the  
summation are non-deterministic after about the 12th decimal place,  
even though all of the following properties are true:


1.  The work is divided into work units in a deterministic fashion.

2.  Within each work unit, the summation happens in a deterministic  
order.


3.  The final summation of the results of all the work units is done  
in a deterministic order.


4.  The smallest term in the summation is about 5e-10.  This means  
the difference across runs is about two orders of magnitude smaller  
than the smallest term.  It can't be a concurrency bug where some  
terms sometimes get skipped.


5.  The results for the individual tasks, not just the final  
summation, differ in the low-order bits.  Each task is executed in a  
single thread.


6.  The rounding mode is apparently the same in all of the threads.

7.  The bug appears even on machines with only one core, as long as  
the number of task pool threads is manually set to 0.  Since it's a  
single core machine, it can't be a low level memory model issue.


What could possibly cause such small, non-deterministic differences  
in floating point results, given everything above?  I'm just looking  
for suggestions here, as I don't even know where to start hunting  
for a bug like this.


It might be due to context switch of threads, that might push out a  
double out of the higher precision 80-bit fpu register, and loose the  
extra precision.
SSE, or float should not have these problems. gcc has an option to  
always store the result in memory, and avoid the extra precision.
maybe having such an optionin dmd to debug such issues would be a nice  
thing.


Fawzi


Re: Floating Point + Threads?

2011-04-16 Thread Iain Buclaw
== Quote from Walter Bright (newshou...@digitalmars.com)'s article
 On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:
  On 4/15/11 10:22 PM, dsimcha wrote:
  I'm trying to debug an extremely strange bug whose symptoms appear in a
  std.parallelism example, though I'm not at all sure the root cause is in
  std.parallelism. The bug report is at
  https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
 
  Does the scheduling affect the summation order?
 That's a good thought. FP addition results can differ dramatically depending 
 on
 associativity.

And not to forget optimisations too. ;)


Re: Floating Point + Threads?

2011-04-16 Thread dsimcha

On 4/15/2011 11:40 PM, Andrei Alexandrescu wrote:

On 4/15/11 10:22 PM, dsimcha wrote:

I'm trying to debug an extremely strange bug whose symptoms appear in a
std.parallelism example, though I'm not at all sure the root cause is in
std.parallelism. The bug report is at
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
.


Does the scheduling affect the summation order?

Andrei


No.  I realize floating point addition isn't associative, but unless 
there's some detail I'm forgetting about, the ordering is deterministic 
within each work unit and the ordering of the final summation is 
deterministic.


Re: Floating Point + Threads?

2011-04-16 Thread dsimcha

On 4/16/2011 2:09 AM, Robert Jacques wrote:

On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha dsim...@yahoo.com wrote:


I'm trying to debug an extremely strange bug whose symptoms appear in
a std.parallelism example, though I'm not at all sure the root cause
is in std.parallelism. The bug report is at
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
.

Basically, the example in question sums up all the elements of a lazy
range (actually, std.algorithm.map) in parallel. It uses
taskPool.reduce, which divides the summation into work units to be
executed in parallel. When executed in parallel, the results of the
summation are non-deterministic after about the 12th decimal place,
even though all of the following properties are true:

1. The work is divided into work units in a deterministic fashion.

2. Within each work unit, the summation happens in a deterministic order.

3. The final summation of the results of all the work units is done in
a deterministic order.

4. The smallest term in the summation is about 5e-10. This means the
difference across runs is about two orders of magnitude smaller than
the smallest term. It can't be a concurrency bug where some terms
sometimes get skipped.

5. The results for the individual tasks, not just the final summation,
differ in the low-order bits. Each task is executed in a single thread.

6. The rounding mode is apparently the same in all of the threads.

7. The bug appears even on machines with only one core, as long as the
number of task pool threads is manually set to 0. Since it's a single
core machine, it can't be a low level memory model issue.

What could possibly cause such small, non-deterministic differences in
floating point results, given everything above? I'm just looking for
suggestions here, as I don't even know where to start hunting for a
bug like this.


Well, on one hand floating point math is not cumulative, and running
sums have many known issues (I'd recommend looking up Khan summation).
On the hand, it should be repeatably different.
As for suggestions? First and foremost, you should always add small to
large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should
the answer be better, but if your error rate goes down, you have a good
idea of where the problem is. I'd also try isolating your
implementation's numerics, from the underlying concurrency. i.e. use a
task pool of 1 and don't let the host thread join it, so the entire job
is done by one worker. The other thing to try is isolation /removing map
and iota from the equation.


Right.  For this example, though, assuming floating point math behaves 
like regular math is a good enough approximation.  The issue isn't that 
the results aren't reasonably accurate.  Furthermore, the results will 
always change slightly depending on how many work units you have.  (I 
even warn in the documentation that floating point addition is not 
associative, though it is approximately associative in the well-behaved 
cases.)


My only concern is whether this non-determinism represents some deep 
underlying bug.  For a given work unit allocation (work unit allocations 
are deterministic and only change when the number of threads changes or 
they're changed explicitly), I can't figure out how scheduling could 
change the results at all.  If I could be sure that it wasn't a symptom 
of an underlying bug in std.parallelism, I wouldn't care about this 
small amount of numerical fuzz.  Floating point math is always inexact 
and parallel summation by its nature can't be made to give the exact 
same results as serial summation.


Re: Floating Point + Threads?

2011-04-16 Thread dsimcha

On 4/16/2011 10:11 AM, dsimcha wrote:

My only concern is whether this non-determinism represents some deep
underlying bug. For a given work unit allocation (work unit allocations
are deterministic and only change when the number of threads changes or
they're changed explicitly), I can't figure out how scheduling could
change the results at all. If I could be sure that it wasn't a symptom
of an underlying bug in std.parallelism, I wouldn't care about this
small amount of numerical fuzz. Floating point math is always inexact
and parallel summation by its nature can't be made to give the exact
same results as serial summation.


Ok, it's definitely **NOT** a bug in std.parallelism.  Here's a reduced 
test case that only uses core.thread, not std.parallelism.  All it does 
is sum an array using std.algorithm.reduce from the main thread, then 
start a new thread to do the same thing and compare answers.  At the 
beginning of the summation function the rounding mode is printed to 
verify that it's the same for both threads.  The two threads get 
slightly different answers.


Just to thoroughly rule out a concurrency bug, I didn't even let the two 
threads execute concurrently.  The main thread produces its result, then 
starts and immediately joins the second thread.


import std.algorithm, core.thread, std.stdio, core.stdc.fenv;

real sumRange(const(real)[] range) {
writeln(Rounding mode:  , fegetround);  // 0 from both threads.
return reduce!a + b(range);
}

void main() {
immutable n = 1_000_000;
immutable delta = 1.0 / n;

auto terms = new real[1_000_000];
foreach(i, ref term; terms) {
immutable x = ( i - 0.5 ) * delta;
term = delta / ( 1.0 + x * x ) * 1;
}

immutable res1 = sumRange(terms);
writefln(%.19f, res1);

real res2;
auto t = new Thread( { res2 = sumRange(terms); } );
t.start();
t.join();
writefln(%.19f, res2);
}


Output:
Rounding mode:  0
0.7853986633972191094
Rounding mode:  0
0.7853986633972437348


Re: Floating Point + Threads?

2011-04-16 Thread dsimcha

On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:

On 4/16/11 9:52 AM, dsimcha wrote:

Output:
Rounding mode: 0
0.7853986633972191094
Rounding mode: 0
0.7853986633972437348


I think at this precision the difference may be in random bits. Probably
you don't need to worry about it.

Andrei


random bits?  I am fully aware that these low order bits are numerical 
fuzz and are meaningless from a practical perspective.  I am only 
concerned because I thought these bits are supposed to be deterministic 
even if they're meaningless.  Now that I've ruled out a bug in 
std.parallelism, I'm wondering if it's a bug in druntime or DMD.


Re: Floating Point + Threads?

2011-04-16 Thread Andrei Alexandrescu

On 4/16/11 9:52 AM, dsimcha wrote:

Output:
Rounding mode: 0
0.7853986633972191094
Rounding mode: 0
0.7853986633972437348


I think at this precision the difference may be in random bits. Probably 
you don't need to worry about it.


Andrei


Re: Floating Point + Threads?

2011-04-16 Thread Iain Buclaw
== Quote from dsimcha (dsim...@yahoo.com)'s article
 On 4/16/2011 10:11 AM, dsimcha wrote:
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348

This is not something I can replicate on my workstation.


Re: Floating Point + Threads?

2011-04-16 Thread dsimcha

On 4/16/2011 11:16 AM, Iain Buclaw wrote:

== Quote from dsimcha (dsim...@yahoo.com)'s article

On 4/16/2011 10:11 AM, dsimcha wrote:
Output:
Rounding mode:  0
0.7853986633972191094
Rounding mode:  0
0.7853986633972437348


This is not something I can replicate on my workstation.


Interesting.  Since I know you're a Linux user, I fired up my Ubuntu VM 
and tried out my test case.  I can't reproduce it either on Linux, only 
on Windows.


Re: Floating Point + Threads?

2011-04-16 Thread JimBob

dsimcha dsim...@yahoo.com wrote in message 
news:iocalv$2h58$1...@digitalmars.com...
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348

Could be something somewhere is getting truncated from real to double, which 
would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision 
in one of the threads? 




Re: Floating Point + Threads?

2011-04-16 Thread Andrei Alexandrescu

On 4/16/11 9:59 AM, dsimcha wrote:

On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:

On 4/16/11 9:52 AM, dsimcha wrote:

Output:
Rounding mode: 0
0.7853986633972191094
Rounding mode: 0
0.7853986633972437348


I think at this precision the difference may be in random bits. Probably
you don't need to worry about it.

Andrei


random bits? I am fully aware that these low order bits are numerical
fuzz and are meaningless from a practical perspective. I am only
concerned because I thought these bits are supposed to be deterministic
even if they're meaningless. Now that I've ruled out a bug in
std.parallelism, I'm wondering if it's a bug in druntime or DMD.


I seem to remember that essentially some of the last bits printed in 
such a result are essentially arbitrary. I forgot what could cause this.


Andrei


Re: Floating Point + Threads?

2011-04-16 Thread Walter Bright

On 4/16/2011 6:46 AM, Iain Buclaw wrote:

== Quote from Walter Bright (newshou...@digitalmars.com)'s article

That's a good thought. FP addition results can differ dramatically depending on
associativity.


And not to forget optimisations too. ;)


The dmd optimizer is careful not to reorder evaluation in such a way as to 
change the results.


Re: Floating Point + Threads?

2011-04-16 Thread Timon Gehr

 Could be something somewhere is getting truncated from real to double, which
 would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision
 in one of the threads?

Yes indeed, this is a _Windows_ bug.
I have experienced this in Windows before, the main thread's FPU state register 
is
initialized to lower FPU-Precision (64bits) by default by the OS, presumably to
make FP calculations faster. However, when you start a new thread, the FPU will
use the whole 80 bits for computation because, curiously, the FPU is not
reconfigured for those.
Suggested fix: Add

asm{fninit};

to the beginning of your main function, and the difference between the two will 
be
gone.

This would be a compatibility issue DMD/windows which disables the real data
type. You might want to file a bug report for druntime if my suggested fix 
works.
(This would imply that the real type was basically identical to the double type 
in
Windows all along!)


Re: Floating Point + Threads?

2011-04-16 Thread dsimcha

On 4/16/2011 2:15 PM, Timon Gehr wrote:



Could be something somewhere is getting truncated from real to double, which
would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision
in one of the threads?


Yes indeed, this is a _Windows_ bug.
I have experienced this in Windows before, the main thread's FPU state register 
is
initialized to lower FPU-Precision (64bits) by default by the OS, presumably to
make FP calculations faster. However, when you start a new thread, the FPU will
use the whole 80 bits for computation because, curiously, the FPU is not
reconfigured for those.
Suggested fix: Add

asm{fninit};

to the beginning of your main function, and the difference between the two will 
be
gone.

This would be a compatibility issue DMD/windows which disables the real data
type. You might want to file a bug report for druntime if my suggested fix 
works.
(This would imply that the real type was basically identical to the double type 
in
Windows all along!)


Close:  If I add this instruction to the function for the new thread, 
the difference goes away.  The relevant statement is:


auto t = new Thread( {
asm { fninit; }
res2 = sumRange(terms);
} );

At any rate, this is a **huge** WTF that should probably be fixed in 
druntime.  Once I understand it a little better, I'll file a bug report.


Re: Floating Point + Threads?

2011-04-16 Thread Iain Buclaw
== Quote from Walter Bright (newshou...@digitalmars.com)'s article
 On 4/16/2011 6:46 AM, Iain Buclaw wrote:
  == Quote from Walter Bright (newshou...@digitalmars.com)'s article
  That's a good thought. FP addition results can differ dramatically 
  depending on
  associativity.
 
  And not to forget optimisations too. ;)
 The dmd optimizer is careful not to reorder evaluation in such a way as to
 change the results.

And so it rightly shouldn't!

I was thinking more of a case of FPU precision rather than ordering: as in you 
get
a different result computing on SSE in double precision mode on the one hand, 
and
by computing on x87 in double precision then writing to a double variable in 
memory.


Classic example (which could either be a bug or non-bug depending on your POV):

void test(double x, double y)
{
  double y2 = x + 1.0;
  assert(y == y2);   // triggers with -O
}

void main()
{
  double x = .012;
  double y = x + 1.0;
  test(x, y);
}



Re: Floating Point + Threads?

2011-04-16 Thread dsimcha

On 4/16/2011 2:24 PM, dsimcha wrote:

On 4/16/2011 2:15 PM, Timon Gehr wrote:



Could be something somewhere is getting truncated from real to
double, which
would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower
precision
in one of the threads?


Yes indeed, this is a _Windows_ bug.
I have experienced this in Windows before, the main thread's FPU state
register is
initialized to lower FPU-Precision (64bits) by default by the OS,
presumably to
make FP calculations faster. However, when you start a new thread, the
FPU will
use the whole 80 bits for computation because, curiously, the FPU is not
reconfigured for those.
Suggested fix: Add

asm{fninit};

to the beginning of your main function, and the difference between the
two will be
gone.

This would be a compatibility issue DMD/windows which disables the
real data
type. You might want to file a bug report for druntime if my suggested
fix works.
(This would imply that the real type was basically identical to the
double type in
Windows all along!)


Close: If I add this instruction to the function for the new thread, the
difference goes away. The relevant statement is:

auto t = new Thread( {
asm { fninit; }
res2 = sumRange(terms);
} );

At any rate, this is a **huge** WTF that should probably be fixed in
druntime. Once I understand it a little better, I'll file a bug report.


Read up a little on what fninit does, etc.  This is IMHO a druntime bug. 
 Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .


Re: Floating Point + Threads?

2011-04-16 Thread Walter Bright

On 4/16/2011 9:52 AM, Andrei Alexandrescu wrote:

I seem to remember that essentially some of the last bits printed in such a
result are essentially arbitrary. I forgot what could cause this.


To see what the exact bits are, print using %A format.

In any case, floating point bits are not random. They are completely 
deterministic.


Re: Floating Point + Threads?

2011-04-16 Thread Walter Bright

On 4/16/2011 11:43 AM, Iain Buclaw wrote:

I was thinking more of a case of FPU precision rather than ordering: as in you 
get
a different result computing on SSE in double precision mode on the one hand, 
and
by computing on x87 in double precision then writing to a double variable in 
memory.


You're right on that one.


Re: Floating Point + Threads?

2011-04-16 Thread Sean Kelly

On Apr 16, 2011, at 11:43 AM, dsimcha wrote:
 
 Close: If I add this instruction to the function for the new thread, the
 difference goes away. The relevant statement is:
 
 auto t = new Thread( {
 asm { fninit; }
 res2 = sumRange(terms);
 } );
 
 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug report.
 
 Read up a little on what fninit does, etc.  This is IMHO a druntime bug.  
 Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .

Really a Windows bug that should be fixed in druntime :-)  I know I'm splitting 
hairs.  This will be fixed for the next release.



Re: Floating Point + Threads?

2011-04-16 Thread Walter Bright

On 4/16/2011 11:51 AM, Sean Kelly wrote:


On Apr 16, 2011, at 11:43 AM, dsimcha wrote:


Close: If I add this instruction to the function for the new thread, the
difference goes away. The relevant statement is:

auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } );

At any rate, this is a **huge** WTF that should probably be fixed in
druntime. Once I understand it a little better, I'll file a bug report.


Read up a little on what fninit does, etc.  This is IMHO a druntime bug.
Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .


Really a Windows bug that should be fixed in druntime :-)  I know I'm
splitting hairs.  This will be fixed for the next release.



The dmd startup code (actually the C startup code) does an fninit. I never 
thought about new thread starts. So, yeah, druntime should do an fninit on 
thread creation.


Re: Floating Point + Threads?

2011-04-16 Thread bearophile
Walter:

 The dmd startup code (actually the C startup code) does an fninit. I never 
 thought about new thread starts. So, yeah, druntime should do an fninit on 
 thread creation.

My congratulations to all the (mostly two) people involved in finding this bug 
and its causes :-)
I'd like to see this module in Phobos.

Bye,
bearophile


Re: Floating Point + Threads?

2011-04-16 Thread Robert Jacques
On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright  
newshou...@digitalmars.com wrote:



On 4/16/2011 11:51 AM, Sean Kelly wrote:


On Apr 16, 2011, at 11:43 AM, dsimcha wrote:


Close: If I add this instruction to the function for the new thread,  
the

difference goes away. The relevant statement is:

auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } );

At any rate, this is a **huge** WTF that should probably be fixed in
druntime. Once I understand it a little better, I'll file a bug  
report.


Read up a little on what fninit does, etc.  This is IMHO a druntime  
bug.

Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .


Really a Windows bug that should be fixed in druntime :-)  I know I'm
splitting hairs.  This will be fixed for the next release.



The dmd startup code (actually the C startup code) does an fninit. I  
never thought about new thread starts. So, yeah, druntime should do an  
fninit on thread creation.


The documentation I've found on fninit seems to indicate it defaults to  
64-bit precision, which means that by default we aren't seeing the benefit  
of D's reals. I'd much prefer 80-bit precision by default.


Floating Point + Threads?

2011-04-15 Thread dsimcha
I'm trying to debug an extremely strange bug whose symptoms appear in a 
std.parallelism example, though I'm not at all sure the root cause is in 
std.parallelism.  The bug report is at 
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .


Basically, the example in question sums up all the elements of a lazy 
range (actually, std.algorithm.map) in parallel.  It uses 
taskPool.reduce, which divides the summation into work units to be 
executed in parallel.  When executed in parallel, the results of the 
summation are non-deterministic after about the 12th decimal place, even 
though all of the following properties are true:


1.  The work is divided into work units in a deterministic fashion.

2.  Within each work unit, the summation happens in a deterministic order.

3.  The final summation of the results of all the work units is done in 
a deterministic order.


4.  The smallest term in the summation is about 5e-10.  This means the 
difference across runs is about two orders of magnitude smaller than the 
smallest term.  It can't be a concurrency bug where some terms sometimes 
get skipped.


5.  The results for the individual tasks, not just the final summation, 
differ in the low-order bits.  Each task is executed in a single thread.


6.  The rounding mode is apparently the same in all of the threads.

7.  The bug appears even on machines with only one core, as long as the 
number of task pool threads is manually set to 0.  Since it's a single 
core machine, it can't be a low level memory model issue.


What could possibly cause such small, non-deterministic differences in 
floating point results, given everything above?  I'm just looking for 
suggestions here, as I don't even know where to start hunting for a bug 
like this.


Re: Floating Point + Threads?

2011-04-15 Thread Andrei Alexandrescu

On 4/15/11 10:22 PM, dsimcha wrote:

I'm trying to debug an extremely strange bug whose symptoms appear in a
std.parallelism example, though I'm not at all sure the root cause is in
std.parallelism. The bug report is at
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .


Does the scheduling affect the summation order?

Andrei