On Sat, Apr 20, 2013 at 10:05 PM, Ted Dunning <[email protected]> wrote:

> On Sat, Apr 20, 2013 at 9:45 AM, Dan Filimon <[email protected]
> >wrote:
>
> > Here is a preliminary version:
> > https://reviews.apache.org/r/10669/diff/#index_header
> >
> > Regarding the parallelization, the results would be valid as long as the
> > aggregating function is both commutative and associative (which we can
> now
> > check) but it adding the parallelization here might be too much work.
> >
>
> Actually, associativity and commutativity don't make threads safe.
>

I imagined the way it works is the most basic one – that each thread is
responsible on some chunk of the vector. Then, no two threads access the
same elements in the vector, but if you have n threads you need to combine
the results at the end.

If the operation on the vector is an aggregate + combine, I think that
associativity and commutativity of the aggregating function are enough
because it guarantees that we can process the chunks in any order and put
them together at the end resulting in the same value as if we had just gone
through the elements sequentially.

The problem arises from NUMA memory models.  The problem is that different
> threads can easily wind up accessing old versions of memory that other
> threads are working on.
>

Yeah, I agree that dealing with writes from other threads can cause
problems, which is why I think that when possible, the threads should not
share the data they're working on and definitely not read and write to it
concurrently.

One of the most important things that synchronization boundaries do is to
> force a partial ordering on memory operations as seen by all threads.
>  Different threads may see different orderings, but all of the orderings
> will be consistent with the weak ordering specified by the synch
> boundaries.
>
> It is possible to do lockless thread-safe code, but it requires very clever
> design.  Our matrices and vectors definitely are not designed that way.
>

I think that for performing parallel operations (at least on Vectors) if we
were to do this in a multithreaded way, we'd chunk up the vector (maybe
with a special iterator?) and do whatever we need to by only reading from
the initial vector and putting intermediate results in something like an
OrderedIntDoubleMapping. The resulting mappings could then be merge to form
the final vector.
This would be pretty similar to what we do since our plus/minus/... don't
mutate the original vector (however we'd do 1 more copy when merging the
chunks).

Would this not work?

Reply via email to