On 4/16/2011 2:09 AM, Robert Jacques wrote:
On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsim...@yahoo.com> wrote:

I'm trying to debug an extremely strange bug whose symptoms appear in
a std.parallelism example, though I'm not at all sure the root cause
is in std.parallelism. The bug report is at
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
.

Basically, the example in question sums up all the elements of a lazy
range (actually, std.algorithm.map) in parallel. It uses
taskPool.reduce, which divides the summation into work units to be
executed in parallel. When executed in parallel, the results of the
summation are non-deterministic after about the 12th decimal place,
even though all of the following properties are true:

1. The work is divided into work units in a deterministic fashion.

2. Within each work unit, the summation happens in a deterministic order.

3. The final summation of the results of all the work units is done in
a deterministic order.

4. The smallest term in the summation is about 5e-10. This means the
difference across runs is about two orders of magnitude smaller than
the smallest term. It can't be a concurrency bug where some terms
sometimes get skipped.

5. The results for the individual tasks, not just the final summation,
differ in the low-order bits. Each task is executed in a single thread.

6. The rounding mode is apparently the same in all of the threads.

7. The bug appears even on machines with only one core, as long as the
number of task pool threads is manually set to >0. Since it's a single
core machine, it can't be a low level memory model issue.

What could possibly cause such small, non-deterministic differences in
floating point results, given everything above? I'm just looking for
suggestions here, as I don't even know where to start hunting for a
bug like this.

Well, on one hand floating point math is not cumulative, and running
sums have many known issues (I'd recommend looking up Khan summation).
On the hand, it should be repeatably different.
As for suggestions? First and foremost, you should always add small to
large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should
the answer be better, but if your error rate goes down, you have a good
idea of where the problem is. I'd also try isolating your
implementation's numerics, from the underlying concurrency. i.e. use a
task pool of 1 and don't let the host thread join it, so the entire job
is done by one worker. The other thing to try is isolation /removing map
and iota from the equation.

Right. For this example, though, assuming floating point math behaves like regular math is a good enough approximation. The issue isn't that the results aren't reasonably accurate. Furthermore, the results will always change slightly depending on how many work units you have. (I even warn in the documentation that floating point addition is not associative, though it is approximately associative in the well-behaved cases.)

My only concern is whether this non-determinism represents some deep underlying bug. For a given work unit allocation (work unit allocations are deterministic and only change when the number of threads changes or they're changed explicitly), I can't figure out how scheduling could change the results at all. If I could be sure that it wasn't a symptom of an underlying bug in std.parallelism, I wouldn't care about this small amount of numerical fuzz. Floating point math is always inexact and parallel summation by its nature can't be made to give the exact same results as serial summation.

Reply via email to