Sure, I am not saying we will get there soon, but I am glad we agree on
what is desirable.

Actually my reference to SST was not in regard to multi-threading within a
host, but as a system that parallelizes the simulation of a single
cache-coherent system across multiple hosts. I am not advocating their
approach :). I was just pushing back on your statement that parallelizing
the simulation of a single cache-coherent system is "questionable" by
providing a counter-example. If you want to counter that by calling SST
itself questionable, go ahead; I don't know what their speedup numbers look
like, so I can neither criticize nor defend them on that point.

When I mentioned that you could probably get decent speedups parallelizing
a large KNL-like coherent system across 4-8 cores, I was thinking of a
single-process parallel model like our multi-queue model, where the
synchronization overheads should be much lower. Also I meant "decent
speedups" with respect to optimized single-threaded simulations, factoring
in the overheads of parallelization. We haven't shown this yet, but I don't
think there are fundamental reasons it couldn't be achieved.

Anyway, getting back to nearer-term issues, I'll say again that the one
thing I clearly prefer about pd-gem5 over multi-gem5 is that it is using a
real gem5 switch model, which indicates to me that it should be possible to
create a single-process single-threaded gem5 simulation that gets the same
result as a parallel simulation. I don't think you can do that in
multi-gem5, since you have to have the switch model running in its own
process since it's not really a gem5 model. It's not a fatal flaw, and in
the near term there may not even be significant practical consequences, but
to me it's rather inelegant in that it is tying the parallelization and the
simulation model very intimately together, rather than trying to provide a
general framework for multi-host parallel simulation.

For example, let's say I decided I wanted to model a non-Ethernet network
(maybe InfiniBand?), and wanted to model it in more detail with multiple
switches and links between the switches.  Lets further suppose that I
wanted to build a single set of IB switch and link models (as SimObjects)
and use them in two modes: one with trace-driven network traffic, where
perhaps a single-threaded single-process simulation would be fast enough,
and in an execution-driven model, where I would want to parallelize the
simulation across multiple hosts. It seems like that would be a lot more
straightforward in pd-gem5.

So at a high level it seems to me that a solution that combines the
MultiIface work from multi-gem5 with the pd-gem5 switch model would be the
best of both. I haven't looked at the code closely enough to know why that
won't work, so I'll let you tell me.

Steve


On Mon, Jul 13, 2015 at 12:24 AM Andreas Hansson <[email protected]>
wrote:

> Hi Steve,
>
> Thanks for the elaborate comments. I agree with all the points of what is
> desired, but I am also painfully aware of some real empirical data points
> suggesting it will be difficult, if not impossible, to get there.
>
> To take a concrete example, you mention SST for multi-threading within one
> host. It may well give you 4-8X speedup doing so, but comparing gem5
> classic to SST, we are looking at roughly a 4X speed difference. Hence,
> you only gain back what you lost in making it multi-threaded in the first
> place, and now you are using ~8X the resources. Hence my worry with trying
> to find one mechanism or doing it all within the simulator. I hope you’re
> right (and I’m wrong), but I would like to see at least one data point
> hinting that it is possible to achieve what you are describing. So far I
> am not convinced.
>
> Andreas
>
>
> On 13/07/2015 05:36, "gem5-dev on behalf of Steve Reinhardt"
> <[email protected] on behalf of [email protected]> wrote:
>
> >Hi Andreas,
> >
> >Thanks for the comments---I partially agree, but I think the structure of
> >your comments is the most interesting to me, as I believe it reveals a
> >difference in our thinking. I'll elaborate below. (Now that I'm done, I'll
> >apologize in advance for perhaps elaborating too much!)
> >
> >On Wed, Jul 8, 2015 at 1:23 PM Andreas Hansson <[email protected]>
> >wrote:
> >
> >> Gents,
> >>
> >> I’ll let Gabor expound on the value of the non-synchronised checkpoints.
> >>
> >> When it comes to the parallelisation, I think it is pretty clear that:
> >>
> >> 1. The value of parallelising a single(cache coherent) gem5 instance is
> >> questionable,
> >
> >
> >I think that depends a lot on the parameters.  If you are trying to model
> >a
> >quad-core system on a quad-core system, then I agree with you. However, if
> >the number of simulated cores >> the number of host cores, it can make
> >sense, as then each host core will model multiple simulated cores, and the
> >relative overhead of synchronization will go down. So if you're trying to
> >model something like a Knights Landing chip with 60+ cores, I expect you
> >could get pretty decent speedup if you parallelized the simulation across
> >4-8 host cores.
> >
> >Things also look a little different if you're doing heterogeneous nodes;
> >perhaps you might benefit from having one thread model all the CPUs while
> >another thread (or few threads) are used to model the GPU.
> >
> >Note that, IIRC, the SST folks at Sandia are mostly using SST to model
> >large-scale multi-threaded systems, not distributed message-passing
> >systems---and this is using MPI for parallelization, not shared memory.
> >
> >
> >> and the cost of making gem5 thread safe is high.
> >
> >
> >While this is indisputably true for the patch we have up on reviewboard,
> >I'm not convinced that's a fundamental truth.  I think that with some
> >effort this cost can be driven down a lot.
> >
> >
> >> That said,
> >> if someone wants to do it, the multi-event-queue approach seems like a
> >> good start.
> >>
> >
> >No argument there.
> >
> >
> >>
> >> 2. Parallelising gem5 on the node level and the inter-node level, using
> >> one mechanism seems like an odd goal.
> >
> >
> >When you say "node" here, do you mean host node or simulated node? If the
> >former, I agree; if the latter, I disagree.
> >
> >In particular, if you mean the latter, then the extrapolation of what
> >you're saying is that we will end up with one model of a multi-node system
> >if we're going to run the model on a single host, and a different model of
> >the same multi-node system if we intend to run the model on multiple
> >hosts---like what we see now with multi-gem5 where the switch model for a
> >distributed simulation isn't even a gem5 model and couldn't be used if you
> >wanted to run the whole model inside a single gem5 process. Having a
> >single
> >simulation model that doesn't change regardless of how we execute the
> >simulation seems a lot more elegant to me, and we actually achieve that
> >with the multi-event-queue feature. Obviously there will be practical
> >constraints on how a model can be partitioned across multiple host nodes,
> >and little things like instantiating a different flavor of EtherLink
> >depending on whether it's an intra-host-node or inter-host-node connection
> >don't bother me that much, but to the extent possible I believe we should
> >keep those as merely practical constraints and not fundamental
> >limitations.
> >
> >
> >> Just like OpenMP and OpenMPI are
> >> well suited for different communication mechanisms, I would argue that
> >>we
> >> need parallelisation techniques well suited for the systems the
> >>simulation
> >> will run on.
> >
> >
> >Yes, I agree, we need a message-based parallelization scheme for
> >multi-node
> >hosts, and a shared-memory based scheme for intra-host-node
> >parallelization. Two different techniques for two different environments.
> >But that doesn't mean they can't co-exist & complement each other, rather
> >than being mutually exclusive options, much like many programs are written
> >in MPI+OpenMP.
> >
> >
> >> A very natural (and efficient) way of doing things is to map
> >> each gem5 instance (and thus simulated node), to a host machine, and
> >>have
> >> the host machines communicate over Ethernet.
> >>
> >
> >That's certainly natural if the number of simulated nodes is equal to the
> >number of host nodes. It's not so obvious to me that you want every
> >simulated node in its own gem5 process, communicating over sockets, if the
> >number of simulated nodes >> the number of host nodes. Sure, given that
> >the
> >code is written, that's a quick way to get things working while we polish
> >up the multi-event-queue fixes, but that doesn't mean it's the ideal
> >long-term strategy. In particular, if you go to the degenerate case where
> >we have multiple simulated nodes and a single host node, then using
> >multiple processes means we have two different parallelization strategies
> >for running on a multi-core shared-memory host. Not that we would (or
> >could) ban people from running multiple gem5 instances on a single system,
> >but a more relevant question is, given a finite amount of effort, would we
> >want to spend it on writing a shared-memory backend for MultiIface or on
> >addressing the performance issues in making a single gem5 process
> >thread-safe? Obviously I favor the latter because I think it's a more
> >general solution, and I believe one that will lead to higher performance
> >on
> >single-node hosts in the end.
> >
> >Note that I'm definitely not saying that all of this needs to be
> >implemented before we commit anything from multi-gem5 or pd-gem5. I'm just
> >trying to establish a vision for where we think gem5 should go with
> >respect
> >to parallelization, so that we can choose short-term steps that align best
> >with that destination, even if they are only initial steps down that path.
> >
> >Steve
> >
> >
> >>
> >> Do you agree?
> >>
> >> Andreas
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >[email protected]
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium.  Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2548782
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to