Re: Floating Point + Threads?
Sean Kelly wrote: On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote: On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set 80-bit precision via the FPU control word. ??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; /** Set the number of bits of precision used by 'real'. * * Returns: the old precision. * This is not supported on all platforms. */ PrecisionControl reduceRealPrecision(PrecisionControl prec) { version(D_InlineAsm_X86) { short cont; asm { fstcw cont; mov CX, cont; mov AX, cont; and EAX, 0x0300; // Form the return value and CX, 0xFCFF; or CX, prec; mov cont, CX; fldcw cont; } } else { assert(0, Not yet supported); } }
Re: Floating Point + Threads?
On Apr 20, 2011, at 5:06 AM, Don wrote: Sean Kelly wrote: On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote: On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set 80-bit precision via the FPU control word. ??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?
Re: Floating Point + Threads?
Sean Kelly s...@invisibleduck.org wrote in message news:mailman.3597.1303316625.4748.digitalmar...@puremagic.com... On Apr 20, 2011, at 5:06 AM, Don wrote: Sean Kelly wrote: On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote: On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set 80-bit precision via the FPU control word. ??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?= You misread the docs, it's talking about precision which is just the size of the mantisa, not the actual full size of the floating point data. IE... 80 float = 64 bit precision 64 float = 53 bit precision 32 float = 24 bit precision
Re: Floating Point + Threads?
On Apr 20, 2011, at 10:46 AM, JimBob wrote: Sean Kelly s...@invisibleduck.org wrote in message news:mailman.3597.1303316625.4748.digitalmar...@puremagic.com... On Apr 20, 2011, at 5:06 AM, Don wrote: Sean Kelly wrote: On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote: On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set 80-bit precision via the FPU control word. ??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?= You misread the docs, it's talking about precision which is just the size of the mantisa, not the actual full size of the floating point data. IE... 80 float = 64 bit precision 64 float = 53 bit precision 32 float = 24 bit precision Oops, you're right. So to summarize: fninit does what we want because it sets 64-bit precision, which is effectively 80-bit mode. Is this correct?
Re: Floating Point + Threads?
Sean Kelly wrote: On Apr 20, 2011, at 10:46 AM, JimBob wrote: Sean Kelly s...@invisibleduck.org wrote in message news:mailman.3597.1303316625.4748.digitalmar...@puremagic.com... On Apr 20, 2011, at 5:06 AM, Don wrote: Sean Kelly wrote: On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote: On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set 80-bit precision via the FPU control word. ??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?= You misread the docs, it's talking about precision which is just the size of the mantisa, not the actual full size of the floating point data. IE... 80 float = 64 bit precision 64 float = 53 bit precision 32 float = 24 bit precision Oops, you're right. So to summarize: fninit does what we want because it sets 64-bit precision, which is effectively 80-bit mode. Is this correct? Yes.
Re: Floating Point + Threads?
On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote: On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set 80-bit precision via the FPU control word. Section 8.1.5.2 of the Intel 64 SDM says the following: The precision-control (PC) field (bits 8 and 9 of the x87 FPU control word) determines the precision (64, 53, or 24 bits) of floating-point calculations made by the x87 FPU (see Table 8-2). The default precision is double extended precision, which uses the full 64-bit significand available with the double extended-precision floating-point format of the x87 FPU data registers. This setting is best suited for most applications, because it allows applications to take full advantage of the maximum precision available with the x87 FPU data registers. So it sounds like finit/fninit does what we want.
Re: Floating Point + Threads?
On Tue, 19 Apr 2011 14:18:46 -0400, Sean Kelly s...@invisibleduck.org wrote: On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote: On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set 80-bit precision via the FPU control word. Section 8.1.5.2 of the Intel 64 SDM says the following: The precision-control (PC) field (bits 8 and 9 of the x87 FPU control word) determines the precision (64, 53, or 24 bits) of floating-point calculations made by the x87 FPU (see Table 8-2). The default precision is double extended precision, which uses the full 64-bit significand available with the double extended-precision floating-point format of the x87 FPU data registers. This setting is best suited for most applications, because it allows applications to take full advantage of the maximum precision available with the x87 FPU data registers. So it sounds like finit/fninit does what we want. Yes, that sounds right. Thanks for clarifying.
Re: Floating Point + Threads?
On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha dsim...@yahoo.com wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel. It uses taskPool.reduce, which divides the summation into work units to be executed in parallel. When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true: 1. The work is divided into work units in a deterministic fashion. 2. Within each work unit, the summation happens in a deterministic order. 3. The final summation of the results of all the work units is done in a deterministic order. 4. The smallest term in the summation is about 5e-10. This means the difference across runs is about two orders of magnitude smaller than the smallest term. It can't be a concurrency bug where some terms sometimes get skipped. 5. The results for the individual tasks, not just the final summation, differ in the low-order bits. Each task is executed in a single thread. 6. The rounding mode is apparently the same in all of the threads. 7. The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to 0. Since it's a single core machine, it can't be a low level memory model issue. What could possibly cause such small, non-deterministic differences in floating point results, given everything above? I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this. Well, on one hand floating point math is not cumulative, and running sums have many known issues (I'd recommend looking up Khan summation). On the hand, it should be repeatably different. As for suggestions? First and foremost, you should always add small to large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should the answer be better, but if your error rate goes down, you have a good idea of where the problem is. I'd also try isolating your implementation's numerics, from the underlying concurrency. i.e. use a task pool of 1 and don't let the host thread join it, so the entire job is done by one worker. The other thing to try is isolation /removing map and iota from the equation.
Re: Floating Point + Threads?
On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote: On 4/15/11 10:22 PM, dsimcha wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Does the scheduling affect the summation order? That's a good thought. FP addition results can differ dramatically depending on associativity.
Re: Floating Point + Threads?
On 16-apr-11, at 09:41, Walter Bright wrote: On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote: On 4/15/11 10:22 PM, dsimcha wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/ 1#issuecomment-1011717 . Does the scheduling affect the summation order? That's a good thought. FP addition results can differ dramatically depending on associativity. yes, one can avoid this by using a tree algorithm with a fixed blocksize, then the results will be the same bothe in single and parallel case. Normally one uses atomic sumation though. In blip I spent quite a bit of thought on tree like algorithms and their parallelization exactly because the parallelize well and are independent form the paralleization Fawzi
Re: Floating Point + Threads?
On 16-apr-11, at 05:22, dsimcha wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1 #issuecomment-1011717 . Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel. It uses taskPool.reduce, which divides the summation into work units to be executed in parallel. When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true: 1. The work is divided into work units in a deterministic fashion. 2. Within each work unit, the summation happens in a deterministic order. 3. The final summation of the results of all the work units is done in a deterministic order. 4. The smallest term in the summation is about 5e-10. This means the difference across runs is about two orders of magnitude smaller than the smallest term. It can't be a concurrency bug where some terms sometimes get skipped. 5. The results for the individual tasks, not just the final summation, differ in the low-order bits. Each task is executed in a single thread. 6. The rounding mode is apparently the same in all of the threads. 7. The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to 0. Since it's a single core machine, it can't be a low level memory model issue. What could possibly cause such small, non-deterministic differences in floating point results, given everything above? I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this. It might be due to context switch of threads, that might push out a double out of the higher precision 80-bit fpu register, and loose the extra precision. SSE, or float should not have these problems. gcc has an option to always store the result in memory, and avoid the extra precision. maybe having such an optionin dmd to debug such issues would be a nice thing. Fawzi
Re: Floating Point + Threads?
== Quote from Walter Bright (newshou...@digitalmars.com)'s article On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote: On 4/15/11 10:22 PM, dsimcha wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Does the scheduling affect the summation order? That's a good thought. FP addition results can differ dramatically depending on associativity. And not to forget optimisations too. ;)
Re: Floating Point + Threads?
On 4/15/2011 11:40 PM, Andrei Alexandrescu wrote: On 4/15/11 10:22 PM, dsimcha wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Does the scheduling affect the summation order? Andrei No. I realize floating point addition isn't associative, but unless there's some detail I'm forgetting about, the ordering is deterministic within each work unit and the ordering of the final summation is deterministic.
Re: Floating Point + Threads?
On 4/16/2011 2:09 AM, Robert Jacques wrote: On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha dsim...@yahoo.com wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel. It uses taskPool.reduce, which divides the summation into work units to be executed in parallel. When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true: 1. The work is divided into work units in a deterministic fashion. 2. Within each work unit, the summation happens in a deterministic order. 3. The final summation of the results of all the work units is done in a deterministic order. 4. The smallest term in the summation is about 5e-10. This means the difference across runs is about two orders of magnitude smaller than the smallest term. It can't be a concurrency bug where some terms sometimes get skipped. 5. The results for the individual tasks, not just the final summation, differ in the low-order bits. Each task is executed in a single thread. 6. The rounding mode is apparently the same in all of the threads. 7. The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to 0. Since it's a single core machine, it can't be a low level memory model issue. What could possibly cause such small, non-deterministic differences in floating point results, given everything above? I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this. Well, on one hand floating point math is not cumulative, and running sums have many known issues (I'd recommend looking up Khan summation). On the hand, it should be repeatably different. As for suggestions? First and foremost, you should always add small to large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should the answer be better, but if your error rate goes down, you have a good idea of where the problem is. I'd also try isolating your implementation's numerics, from the underlying concurrency. i.e. use a task pool of 1 and don't let the host thread join it, so the entire job is done by one worker. The other thing to try is isolation /removing map and iota from the equation. Right. For this example, though, assuming floating point math behaves like regular math is a good enough approximation. The issue isn't that the results aren't reasonably accurate. Furthermore, the results will always change slightly depending on how many work units you have. (I even warn in the documentation that floating point addition is not associative, though it is approximately associative in the well-behaved cases.) My only concern is whether this non-determinism represents some deep underlying bug. For a given work unit allocation (work unit allocations are deterministic and only change when the number of threads changes or they're changed explicitly), I can't figure out how scheduling could change the results at all. If I could be sure that it wasn't a symptom of an underlying bug in std.parallelism, I wouldn't care about this small amount of numerical fuzz. Floating point math is always inexact and parallel summation by its nature can't be made to give the exact same results as serial summation.
Re: Floating Point + Threads?
On 4/16/2011 10:11 AM, dsimcha wrote: My only concern is whether this non-determinism represents some deep underlying bug. For a given work unit allocation (work unit allocations are deterministic and only change when the number of threads changes or they're changed explicitly), I can't figure out how scheduling could change the results at all. If I could be sure that it wasn't a symptom of an underlying bug in std.parallelism, I wouldn't care about this small amount of numerical fuzz. Floating point math is always inexact and parallel summation by its nature can't be made to give the exact same results as serial summation. Ok, it's definitely **NOT** a bug in std.parallelism. Here's a reduced test case that only uses core.thread, not std.parallelism. All it does is sum an array using std.algorithm.reduce from the main thread, then start a new thread to do the same thing and compare answers. At the beginning of the summation function the rounding mode is printed to verify that it's the same for both threads. The two threads get slightly different answers. Just to thoroughly rule out a concurrency bug, I didn't even let the two threads execute concurrently. The main thread produces its result, then starts and immediately joins the second thread. import std.algorithm, core.thread, std.stdio, core.stdc.fenv; real sumRange(const(real)[] range) { writeln(Rounding mode: , fegetround); // 0 from both threads. return reduce!a + b(range); } void main() { immutable n = 1_000_000; immutable delta = 1.0 / n; auto terms = new real[1_000_000]; foreach(i, ref term; terms) { immutable x = ( i - 0.5 ) * delta; term = delta / ( 1.0 + x * x ) * 1; } immutable res1 = sumRange(terms); writefln(%.19f, res1); real res2; auto t = new Thread( { res2 = sumRange(terms); } ); t.start(); t.join(); writefln(%.19f, res2); } Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
Re: Floating Point + Threads?
On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote: On 4/16/11 9:52 AM, dsimcha wrote: Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348 I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei random bits? I am fully aware that these low order bits are numerical fuzz and are meaningless from a practical perspective. I am only concerned because I thought these bits are supposed to be deterministic even if they're meaningless. Now that I've ruled out a bug in std.parallelism, I'm wondering if it's a bug in druntime or DMD.
Re: Floating Point + Threads?
On 4/16/11 9:52 AM, dsimcha wrote: Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348 I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei
Re: Floating Point + Threads?
== Quote from dsimcha (dsim...@yahoo.com)'s article On 4/16/2011 10:11 AM, dsimcha wrote: Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348 This is not something I can replicate on my workstation.
Re: Floating Point + Threads?
On 4/16/2011 11:16 AM, Iain Buclaw wrote: == Quote from dsimcha (dsim...@yahoo.com)'s article On 4/16/2011 10:11 AM, dsimcha wrote: Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348 This is not something I can replicate on my workstation. Interesting. Since I know you're a Linux user, I fired up my Ubuntu VM and tried out my test case. I can't reproduce it either on Linux, only on Windows.
Re: Floating Point + Threads?
dsimcha dsim...@yahoo.com wrote in message news:iocalv$2h58$1...@digitalmars.com... Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348 Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads?
Re: Floating Point + Threads?
On 4/16/11 9:59 AM, dsimcha wrote: On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote: On 4/16/11 9:52 AM, dsimcha wrote: Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348 I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei random bits? I am fully aware that these low order bits are numerical fuzz and are meaningless from a practical perspective. I am only concerned because I thought these bits are supposed to be deterministic even if they're meaningless. Now that I've ruled out a bug in std.parallelism, I'm wondering if it's a bug in druntime or DMD. I seem to remember that essentially some of the last bits printed in such a result are essentially arbitrary. I forgot what could cause this. Andrei
Re: Floating Point + Threads?
On 4/16/2011 6:46 AM, Iain Buclaw wrote: == Quote from Walter Bright (newshou...@digitalmars.com)'s article That's a good thought. FP addition results can differ dramatically depending on associativity. And not to forget optimisations too. ;) The dmd optimizer is careful not to reorder evaluation in such a way as to change the results.
Re: Floating Point + Threads?
Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads? Yes indeed, this is a _Windows_ bug. I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the real data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!)
Re: Floating Point + Threads?
On 4/16/2011 2:15 PM, Timon Gehr wrote: Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads? Yes indeed, this is a _Windows_ bug. I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the real data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!) Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report.
Re: Floating Point + Threads?
== Quote from Walter Bright (newshou...@digitalmars.com)'s article On 4/16/2011 6:46 AM, Iain Buclaw wrote: == Quote from Walter Bright (newshou...@digitalmars.com)'s article That's a good thought. FP addition results can differ dramatically depending on associativity. And not to forget optimisations too. ;) The dmd optimizer is careful not to reorder evaluation in such a way as to change the results. And so it rightly shouldn't! I was thinking more of a case of FPU precision rather than ordering: as in you get a different result computing on SSE in double precision mode on the one hand, and by computing on x87 in double precision then writing to a double variable in memory. Classic example (which could either be a bug or non-bug depending on your POV): void test(double x, double y) { double y2 = x + 1.0; assert(y == y2); // triggers with -O } void main() { double x = .012; double y = x + 1.0; test(x, y); }
Re: Floating Point + Threads?
On 4/16/2011 2:24 PM, dsimcha wrote: On 4/16/2011 2:15 PM, Timon Gehr wrote: Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads? Yes indeed, this is a _Windows_ bug. I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the real data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!) Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report. Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .
Re: Floating Point + Threads?
On 4/16/2011 9:52 AM, Andrei Alexandrescu wrote: I seem to remember that essentially some of the last bits printed in such a result are essentially arbitrary. I forgot what could cause this. To see what the exact bits are, print using %A format. In any case, floating point bits are not random. They are completely deterministic.
Re: Floating Point + Threads?
On 4/16/2011 11:43 AM, Iain Buclaw wrote: I was thinking more of a case of FPU precision rather than ordering: as in you get a different result computing on SSE in double precision mode on the one hand, and by computing on x87 in double precision then writing to a double variable in memory. You're right on that one.
Re: Floating Point + Threads?
On Apr 16, 2011, at 11:43 AM, dsimcha wrote: Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report. Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 . Really a Windows bug that should be fixed in druntime :-) I know I'm splitting hairs. This will be fixed for the next release.
Re: Floating Point + Threads?
On 4/16/2011 11:51 AM, Sean Kelly wrote: On Apr 16, 2011, at 11:43 AM, dsimcha wrote: Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report. Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 . Really a Windows bug that should be fixed in druntime :-) I know I'm splitting hairs. This will be fixed for the next release. The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
Re: Floating Point + Threads?
Walter: The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. My congratulations to all the (mostly two) people involved in finding this bug and its causes :-) I'd like to see this module in Phobos. Bye, bearophile
Re: Floating Point + Threads?
On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright newshou...@digitalmars.com wrote: On 4/16/2011 11:51 AM, Sean Kelly wrote: On Apr 16, 2011, at 11:43 AM, dsimcha wrote: Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report. Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 . Really a Windows bug that should be fixed in druntime :-) I know I'm splitting hairs. This will be fixed for the next release. The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation. The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
Floating Point + Threads?
I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Basically, the example in question sums up all the elements of a lazy range (actually, std.algorithm.map) in parallel. It uses taskPool.reduce, which divides the summation into work units to be executed in parallel. When executed in parallel, the results of the summation are non-deterministic after about the 12th decimal place, even though all of the following properties are true: 1. The work is divided into work units in a deterministic fashion. 2. Within each work unit, the summation happens in a deterministic order. 3. The final summation of the results of all the work units is done in a deterministic order. 4. The smallest term in the summation is about 5e-10. This means the difference across runs is about two orders of magnitude smaller than the smallest term. It can't be a concurrency bug where some terms sometimes get skipped. 5. The results for the individual tasks, not just the final summation, differ in the low-order bits. Each task is executed in a single thread. 6. The rounding mode is apparently the same in all of the threads. 7. The bug appears even on machines with only one core, as long as the number of task pool threads is manually set to 0. Since it's a single core machine, it can't be a low level memory model issue. What could possibly cause such small, non-deterministic differences in floating point results, given everything above? I'm just looking for suggestions here, as I don't even know where to start hunting for a bug like this.
Re: Floating Point + Threads?
On 4/15/11 10:22 PM, dsimcha wrote: I'm trying to debug an extremely strange bug whose symptoms appear in a std.parallelism example, though I'm not at all sure the root cause is in std.parallelism. The bug report is at https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 . Does the scheduling affect the summation order? Andrei