Hi Andreas,

Thanks for the comments---I partially agree, but I think the structure of
your comments is the most interesting to me, as I believe it reveals a
difference in our thinking. I'll elaborate below. (Now that I'm done, I'll
apologize in advance for perhaps elaborating too much!)

On Wed, Jul 8, 2015 at 1:23 PM Andreas Hansson <andreas.hans...@arm.com>
wrote:

> Gents,
>
> I’ll let Gabor expound on the value of the non-synchronised checkpoints.
>
> When it comes to the parallelisation, I think it is pretty clear that:
>
> 1. The value of parallelising a single(cache coherent) gem5 instance is
> questionable,


I think that depends a lot on the parameters.  If you are trying to model a
quad-core system on a quad-core system, then I agree with you. However, if
the number of simulated cores >> the number of host cores, it can make
sense, as then each host core will model multiple simulated cores, and the
relative overhead of synchronization will go down. So if you're trying to
model something like a Knights Landing chip with 60+ cores, I expect you
could get pretty decent speedup if you parallelized the simulation across
4-8 host cores.

Things also look a little different if you're doing heterogeneous nodes;
perhaps you might benefit from having one thread model all the CPUs while
another thread (or few threads) are used to model the GPU.

Note that, IIRC, the SST folks at Sandia are mostly using SST to model
large-scale multi-threaded systems, not distributed message-passing
systems---and this is using MPI for parallelization, not shared memory.


> and the cost of making gem5 thread safe is high.


While this is indisputably true for the patch we have up on reviewboard,
I'm not convinced that's a fundamental truth.  I think that with some
effort this cost can be driven down a lot.


> That said,
> if someone wants to do it, the multi-event-queue approach seems like a
> good start.
>

No argument there.


>
> 2. Parallelising gem5 on the node level and the inter-node level, using
> one mechanism seems like an odd goal.


When you say "node" here, do you mean host node or simulated node? If the
former, I agree; if the latter, I disagree.

In particular, if you mean the latter, then the extrapolation of what
you're saying is that we will end up with one model of a multi-node system
if we're going to run the model on a single host, and a different model of
the same multi-node system if we intend to run the model on multiple
hosts---like what we see now with multi-gem5 where the switch model for a
distributed simulation isn't even a gem5 model and couldn't be used if you
wanted to run the whole model inside a single gem5 process. Having a single
simulation model that doesn't change regardless of how we execute the
simulation seems a lot more elegant to me, and we actually achieve that
with the multi-event-queue feature. Obviously there will be practical
constraints on how a model can be partitioned across multiple host nodes,
and little things like instantiating a different flavor of EtherLink
depending on whether it's an intra-host-node or inter-host-node connection
don't bother me that much, but to the extent possible I believe we should
keep those as merely practical constraints and not fundamental limitations.


> Just like OpenMP and OpenMPI are
> well suited for different communication mechanisms, I would argue that we
> need parallelisation techniques well suited for the systems the simulation
> will run on.


Yes, I agree, we need a message-based parallelization scheme for multi-node
hosts, and a shared-memory based scheme for intra-host-node
parallelization. Two different techniques for two different environments.
But that doesn't mean they can't co-exist & complement each other, rather
than being mutually exclusive options, much like many programs are written
in MPI+OpenMP.


> A very natural (and efficient) way of doing things is to map
> each gem5 instance (and thus simulated node), to a host machine, and have
> the host machines communicate over Ethernet.
>

That's certainly natural if the number of simulated nodes is equal to the
number of host nodes. It's not so obvious to me that you want every
simulated node in its own gem5 process, communicating over sockets, if the
number of simulated nodes >> the number of host nodes. Sure, given that the
code is written, that's a quick way to get things working while we polish
up the multi-event-queue fixes, but that doesn't mean it's the ideal
long-term strategy. In particular, if you go to the degenerate case where
we have multiple simulated nodes and a single host node, then using
multiple processes means we have two different parallelization strategies
for running on a multi-core shared-memory host. Not that we would (or
could) ban people from running multiple gem5 instances on a single system,
but a more relevant question is, given a finite amount of effort, would we
want to spend it on writing a shared-memory backend for MultiIface or on
addressing the performance issues in making a single gem5 process
thread-safe? Obviously I favor the latter because I think it's a more
general solution, and I believe one that will lead to higher performance on
single-node hosts in the end.

Note that I'm definitely not saying that all of this needs to be
implemented before we commit anything from multi-gem5 or pd-gem5. I'm just
trying to establish a vision for where we think gem5 should go with respect
to parallelization, so that we can choose short-term steps that align best
with that destination, even if they are only initial steps down that path.

Steve


>
> Do you agree?
>
> Andreas
>
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to