Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Mohammad Alian Wed, 15 Jul 2015 22:56:56 -0700

Here I’m trying to summarize this long discussion ( I just summarize the
points that are discussed in this email thread).


1- Synchronization: using one socket for both communication and
synchronization is superior design (multi-gem5). Both pd-gem5 and
multi-gem5 use barrier synchronization.

2- Communication: communicating through socket is sufficient and we don’t
need to expand this with other programing models.

3- Checkpointing: We should go for synchronized checkpointing

Here are some major differences between pd-gem5 and multi-gem5 and their
current status:

Accuracy:

multi-gem5: It sends header packets before each data packet that contains
info for precise packet delivery. Ensuring in-order packet delivery is
still under test.

pd-gem5: Each packet has a time-stamp to ensure precise packet delivery.
Accurate communication is well tested.

Network topologies:

multi-gem5: It can model a star topology.

pd-gem5: It can model arbitrary network topologies.

Integration (gem5 direction for parallelization):

multi-gem5: “multi-gem5 does not support fine grain simulation of
hierarchical switches (or any other network topologies except a single
crossbar) or multiple synchronization domains currently”

pd-gem5: It can be replicated/integrated with single threaded gem5.



-Mohammad

On Thu, Jul 16, 2015 at 12:12 AM, Steve Reinhardt <[email protected]> wrote:

> Sure, I am not saying we will get there soon, but I am glad we agree on
> what is desirable.
>
> Actually my reference to SST was not in regard to multi-threading within a
> host, but as a system that parallelizes the simulation of a single
> cache-coherent system across multiple hosts. I am not advocating their
> approach :). I was just pushing back on your statement that parallelizing
> the simulation of a single cache-coherent system is "questionable" by
> providing a counter-example. If you want to counter that by calling SST
> itself questionable, go ahead; I don't know what their speedup numbers look
> like, so I can neither criticize nor defend them on that point.
>
> When I mentioned that you could probably get decent speedups parallelizing
> a large KNL-like coherent system across 4-8 cores, I was thinking of a
> single-process parallel model like our multi-queue model, where the
> synchronization overheads should be much lower. Also I meant "decent
> speedups" with respect to optimized single-threaded simulations, factoring
> in the overheads of parallelization. We haven't shown this yet, but I don't
> think there are fundamental reasons it couldn't be achieved.
>
> Anyway, getting back to nearer-term issues, I'll say again that the one
> thing I clearly prefer about pd-gem5 over multi-gem5 is that it is using a
> real gem5 switch model, which indicates to me that it should be possible to
> create a single-process single-threaded gem5 simulation that gets the same
> result as a parallel simulation. I don't think you can do that in
> multi-gem5, since you have to have the switch model running in its own
> process since it's not really a gem5 model. It's not a fatal flaw, and in
> the near term there may not even be significant practical consequences, but
> to me it's rather inelegant in that it is tying the parallelization and the
> simulation model very intimately together, rather than trying to provide a
> general framework for multi-host parallel simulation.
>
> For example, let's say I decided I wanted to model a non-Ethernet network
> (maybe InfiniBand?), and wanted to model it in more detail with multiple
> switches and links between the switches.  Lets further suppose that I
> wanted to build a single set of IB switch and link models (as SimObjects)
> and use them in two modes: one with trace-driven network traffic, where
> perhaps a single-threaded single-process simulation would be fast enough,
> and in an execution-driven model, where I would want to parallelize the
> simulation across multiple hosts. It seems like that would be a lot more
> straightforward in pd-gem5.
>
> So at a high level it seems to me that a solution that combines the
> MultiIface work from multi-gem5 with the pd-gem5 switch model would be the
> best of both. I haven't looked at the code closely enough to know why that
> won't work, so I'll let you tell me.
>
> Steve
>
>
> On Mon, Jul 13, 2015 at 12:24 AM Andreas Hansson <[email protected]>
> wrote:
>
> > Hi Steve,
> >
> > Thanks for the elaborate comments. I agree with all the points of what is
> > desired, but I am also painfully aware of some real empirical data points
> > suggesting it will be difficult, if not impossible, to get there.
> >
> > To take a concrete example, you mention SST for multi-threading within
> one
> > host. It may well give you 4-8X speedup doing so, but comparing gem5
> > classic to SST, we are looking at roughly a 4X speed difference. Hence,
> > you only gain back what you lost in making it multi-threaded in the first
> > place, and now you are using ~8X the resources. Hence my worry with
> trying
> > to find one mechanism or doing it all within the simulator. I hope you’re
> > right (and I’m wrong), but I would like to see at least one data point
> > hinting that it is possible to achieve what you are describing. So far I
> > am not convinced.
> >
> > Andreas
> >
> >
> > On 13/07/2015 05:36, "gem5-dev on behalf of Steve Reinhardt"
> > <[email protected] on behalf of [email protected]> wrote:
> >
> > >Hi Andreas,
> > >
> > >Thanks for the comments---I partially agree, but I think the structure
> of
> > >your comments is the most interesting to me, as I believe it reveals a
> > >difference in our thinking. I'll elaborate below. (Now that I'm done,
> I'll
> > >apologize in advance for perhaps elaborating too much!)
> > >
> > >On Wed, Jul 8, 2015 at 1:23 PM Andreas Hansson <[email protected]
> >
> > >wrote:
> > >
> > >> Gents,
> > >>
> > >> I’ll let Gabor expound on the value of the non-synchronised
> checkpoints.
> > >>
> > >> When it comes to the parallelisation, I think it is pretty clear that:
> > >>
> > >> 1. The value of parallelising a single(cache coherent) gem5 instance
> is
> > >> questionable,
> > >
> > >
> > >I think that depends a lot on the parameters.  If you are trying to
> model
> > >a
> > >quad-core system on a quad-core system, then I agree with you. However,
> if
> > >the number of simulated cores >> the number of host cores, it can make
> > >sense, as then each host core will model multiple simulated cores, and
> the
> > >relative overhead of synchronization will go down. So if you're trying
> to
> > >model something like a Knights Landing chip with 60+ cores, I expect you
> > >could get pretty decent speedup if you parallelized the simulation
> across
> > >4-8 host cores.
> > >
> > >Things also look a little different if you're doing heterogeneous nodes;
> > >perhaps you might benefit from having one thread model all the CPUs
> while
> > >another thread (or few threads) are used to model the GPU.
> > >
> > >Note that, IIRC, the SST folks at Sandia are mostly using SST to model
> > >large-scale multi-threaded systems, not distributed message-passing
> > >systems---and this is using MPI for parallelization, not shared memory.
> > >
> > >
> > >> and the cost of making gem5 thread safe is high.
> > >
> > >
> > >While this is indisputably true for the patch we have up on reviewboard,
> > >I'm not convinced that's a fundamental truth.  I think that with some
> > >effort this cost can be driven down a lot.
> > >
> > >
> > >> That said,
> > >> if someone wants to do it, the multi-event-queue approach seems like a
> > >> good start.
> > >>
> > >
> > >No argument there.
> > >
> > >
> > >>
> > >> 2. Parallelising gem5 on the node level and the inter-node level,
> using
> > >> one mechanism seems like an odd goal.
> > >
> > >
> > >When you say "node" here, do you mean host node or simulated node? If
> the
> > >former, I agree; if the latter, I disagree.
> > >
> > >In particular, if you mean the latter, then the extrapolation of what
> > >you're saying is that we will end up with one model of a multi-node
> system
> > >if we're going to run the model on a single host, and a different model
> of
> > >the same multi-node system if we intend to run the model on multiple
> > >hosts---like what we see now with multi-gem5 where the switch model for
> a
> > >distributed simulation isn't even a gem5 model and couldn't be used if
> you
> > >wanted to run the whole model inside a single gem5 process. Having a
> > >single
> > >simulation model that doesn't change regardless of how we execute the
> > >simulation seems a lot more elegant to me, and we actually achieve that
> > >with the multi-event-queue feature. Obviously there will be practical
> > >constraints on how a model can be partitioned across multiple host
> nodes,
> > >and little things like instantiating a different flavor of EtherLink
> > >depending on whether it's an intra-host-node or inter-host-node
> connection
> > >don't bother me that much, but to the extent possible I believe we
> should
> > >keep those as merely practical constraints and not fundamental
> > >limitations.
> > >
> > >
> > >> Just like OpenMP and OpenMPI are
> > >> well suited for different communication mechanisms, I would argue that
> > >>we
> > >> need parallelisation techniques well suited for the systems the
> > >>simulation
> > >> will run on.
> > >
> > >
> > >Yes, I agree, we need a message-based parallelization scheme for
> > >multi-node
> > >hosts, and a shared-memory based scheme for intra-host-node
> > >parallelization. Two different techniques for two different
> environments.
> > >But that doesn't mean they can't co-exist & complement each other,
> rather
> > >than being mutually exclusive options, much like many programs are
> written
> > >in MPI+OpenMP.
> > >
> > >
> > >> A very natural (and efficient) way of doing things is to map
> > >> each gem5 instance (and thus simulated node), to a host machine, and
> > >>have
> > >> the host machines communicate over Ethernet.
> > >>
> > >
> > >That's certainly natural if the number of simulated nodes is equal to
> the
> > >number of host nodes. It's not so obvious to me that you want every
> > >simulated node in its own gem5 process, communicating over sockets, if
> the
> > >number of simulated nodes >> the number of host nodes. Sure, given that
> > >the
> > >code is written, that's a quick way to get things working while we
> polish
> > >up the multi-event-queue fixes, but that doesn't mean it's the ideal
> > >long-term strategy. In particular, if you go to the degenerate case
> where
> > >we have multiple simulated nodes and a single host node, then using
> > >multiple processes means we have two different parallelization
> strategies
> > >for running on a multi-core shared-memory host. Not that we would (or
> > >could) ban people from running multiple gem5 instances on a single
> system,
> > >but a more relevant question is, given a finite amount of effort, would
> we
> > >want to spend it on writing a shared-memory backend for MultiIface or on
> > >addressing the performance issues in making a single gem5 process
> > >thread-safe? Obviously I favor the latter because I think it's a more
> > >general solution, and I believe one that will lead to higher performance
> > >on
> > >single-node hosts in the end.
> > >
> > >Note that I'm definitely not saying that all of this needs to be
> > >implemented before we commit anything from multi-gem5 or pd-gem5. I'm
> just
> > >trying to establish a vision for where we think gem5 should go with
> > >respect
> > >to parallelization, so that we can choose short-term steps that align
> best
> > >with that destination, even if they are only initial steps down that
> path.
> > >
> > >Steve
> > >
> > >
> > >>
> > >> Do you agree?
> > >>
> > >> Andreas
> > >>
> > >_______________________________________________
> > >gem5-dev mailing list
> > >[email protected]
> > >http://m5sim.org/mailman/listinfo/gem5-dev
> >
> >
> > -- IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy
> the
> > information in any medium.  Thank you.
> >
> > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No:  2557590
> > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No:  2548782
> > _______________________________________________
> > gem5-dev mailing list
> > [email protected]
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Reply via email to