Response from Max follows (for some reason he was getting bounced by the
mailing list).


On Sun, Mar 16, 2014 at 8:55 PM, Max Hutchinson <maxhu...@gmail.com> wrote:

> tl;dr it depends on the DAG, but improved ILP is is likely possible (if
> difficult) and there could be room for multi-core parallelism as well.
>
> As I understand it, we're talking about a long computation applied to
> short input vectors.  If the computation can be applied to many input
> vectors at once, independent of each other, then all levels of parallelism
> (multiple instructions, multiple cores, multiple sockets, multiple nodes)
> can be used.  This is data-parallelism, which is great! However, it doesn't
> sound like this is the case.
>
> It sounds like you're thinking of building a DAG of these CSEs and trying
> to use task-parallelism over independent parts of it (automatically using
> sympy or theano or what have you).  The tension here is going to be between
> locality and parallelism: how much compute hardware can you spread your
> data across without losing the nice cache performance that your small input
> vectors gain you.  I'd bet that going off-socket is way too wide.  Modern
> multi-core architectures have core-local L2 and L1 caches, so if your input
> data fits nicely into L2 and your DAG isn't really local, you probably
> won't get anything out of multiple-cores.  Your last stand is single-core
> parallelism (instruction-level 
> parallelism<http://en.wikipedia.org/wiki/Instruction-level_parallelism>),
> which sympy et al may or may not be well equipped to influence.
>
> To start, I'd recommend that you take a look at your DAGs and try to
> figure out how large the independent chunks are.  Then, estimate the amount
> of instruction level parallelism when you run in 'serial' (which you can do
> with flop-counting).  If your demonstrated ILP is less than your
> independent chunk size, then at least improved ILP should be possible.
>  Automatically splitting up these DAGs and expressing them in a low-level
> enough way to affect ILP is a considerable task, though.
>
> To see if multi-core parallelism is worth it, you need to estimate how
> many extra L3 loads you'd incur by spreading your data of multiple L2s.  I
> don't have great advice for that, maybe someone else here does.  The good
> news is that if your problem has this level of locality, then you can
> probably get away with emitting C code with pthreads or even openmp.  Just
> bear in mind the thread creation/annihilation overhead (standing
> thread-pools are your friend) and pin them to cores.
>
> Good luck,
> Max
>

-- 
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sympy+unsubscr...@googlegroups.com.
To post to this group, send email to sympy@googlegroups.com.
Visit this group at http://groups.google.com/group/sympy.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to