Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Mohammad Alian Mon, 10 Aug 2015 22:17:27 -0700

Hi all,

Sorry for taking long to update pd-gem5 patch.


Here is the new set of patches:
http://reviews.gem5.org/r/3025/
http://reviews.gem5.org/r/3024/
http://reviews.gem5.org/r/3023/
These three are new pd-gem5 patches. I discarded the old patches as I have
change them extensively.

http://reviews.gem5.org/r/3021/
This a redistribution of http://reviews.gem5.org/r/2305/, this is
independent from pd-gem5 patch. But since it's a well written switch model
with some features, I think there is value to revisit and  commit this
patch. I have done some slight modification and fixed some bugs in it,
however it still has the issue raised by Steve.


I changed the patch in order to address the its problems based on the
discussion in this email thread. I think this design pretty much has the
best of both pd-gem5 and multi-gem5.

Here is an overview of the new pdgem5 patch:

synchronization: It uses one socket connection for both delivering data
packets and sync messages; so data packets never bypass sync barriers.
Synchronization is done inside EtherTap interface by scheduling sendSync &
recvSync events without any need for a separate process governing the
simulation ("tcp-server" in multi-gem5 or "barrier" process in old
pd-gem5). In each pd-simulation, there is a central switch box that is
responsible for forwarding data packets and managing sendSync & recvSync
messages from other gem5 processes (it still allows hierarchical simulation
that we discussed earlier).

communication: Is taken place via EtherTap interface.

checkpoint: it implements synchronous checkpointing

This new patch provides robustness while preserving all the feature of
previous pd-gem5 design with even cleaner and easier to use implementation.
Regarding hierarchical simulation, I was able to run a simulation with 8
gem5 processes to simulate 32 nodes (each gem5 process simulating 4
full-systems with a local switch and an up-link) and one gem5 process to
simulate a top level switch box (32 node + 9 switches in total). However
this would be useless with single threaded gem5, it would be handy and fast
with multi-threaded gem5.

Thanks,
Mohammad

On Tue, Jul 21, 2015 at 3:30 PM, Mohammad Alian <[email protected]> wrote:

> Hi Steve,
>
> Sorry for the misinterpretation, my comment on communication is not
> correct and there are certainly values in using other programming models.
> My argument is that if we could get multi-threaded gem5 working and model
> multiple nodes inside one multi-threaded gem5 process (which is definitely
> faster that using separate gem5 processes), then it's reasonable to assume
> that the number of nodes in pd-gem5 would not exceed a couple of gem5
> processes, and as we can synchronize pd-gem5 nodes in higher granularity,
> then it will fade the value of having higher performance communication
> programming models. Nevertheless, although pd-gem5 doesn't provide abstract
> functions for communication, I think the effort for implementing it with
> other programming model would be in par with effort needed for multi-gem5.
>
> Maybe I'm missing something, but I didn't understand your point here:
> "Note that this doesn't strictly require that the switch model is
> co-located with this central coordination process; that just happens to be
> convenient and efficient.". I cannot see how we can have both central
> server and switch model (potentially distributed), because central server
> do packet routing which supposed to be done in switch model.
>
> Thank you for proposing the reduction method for ensuring on-time packet
> delivery. We can implement this in pd-gem5, and I'm working on that right
> now. Also, I modified pd-gem5 patch in order to enable replication of same
> simulation inside one process as well as distributing switch model within
> full-system gem5 processes (hierarchical network topology). This should
> also work with multi-threaded gem5 as the synchronization of pd-gem5 nodes
> is independent of internal synchronizations of multi-threaded gem5
> processes. I'll update pd-gem5 patch soon.
>
> Thank you,
> Mohammad
>
> On Sat, Jul 18, 2015 at 7:44 PM, Steve Reinhardt <[email protected]> wrote:
>
>> Hi Mohammad,
>>
>> Thanks for the summaries & responses.
>>
>> I agree with your summaries on synchronization and checkpointing. However,
>> as far as communication goes, I'd like to clarify, as I'm not sure exactly
>> what you mean by "communicating through socket is sufficient and we don’t
>> need
>> to expand this with other programming models". Socket communication is
>> sufficient for now, but I think there is a potentially a lot of value in
>> being able to take advantage of higher-performance networking models such
>> as MPI and InfiniBand. That's what I like about MultiIface, is that it
>> provides an abstraction that should enable the development of other
>> messaging layers. It's a little premature to know exactly how well that
>> will work until you try and develop a second implementation, but at least
>> the concept is right.
>>
>> As far as synchronization, that's a harder problem. I do think we need to
>> have a model that always works, not almost always works. This is
>> particularly challenging with sockets, which weren't built for fine-grain
>> communication and synchronization, which is one reason why I think there
>> would be a lot of value in moving to more HPC-oriented communication
>> models
>> like MPI on systems that support it. In fact, I know that the common MPI
>> platforms (MPICH and Open MPI) both support Ethernet transport, so it
>> might
>> even be the case that coding to MPI would provide performance that's as
>> good as or better then going directly to sockets.
>>
>> The advantage of having all communication routed through a single central
>> server (as in multi-gem5) is that you can provide ordering guarantees
>> between the point-to-point messages and the barrier messages, so that you
>> can guarantee that a message that's sent before a barrier is initiated has
>> been received and processed before that barrier completes. Note that this
>> doesn't strictly require that the switch model is co-located with this
>> central coordination process; that just happens to be convenient and
>> efficient.
>>
>> There are other ways to guarantee all messages have been delivered other
>> than just using socket ordering. For example, if you have each node track
>> the net number of messages it has sent (msgs sent - msgs rcvd) and do a
>> reduction on this value instead of a simpler barrier, then you now all
>> messages have been delivered when this value reaches zero. (In fact, IIRC,
>> that's what we did in WWT, using the CM-5 hardware reduction network.) You
>> have to iterate over the reduction until the value is zero, which could
>> theoretically cost some performance, but depending on the timing of the
>> network the number of iterations would be small---though you'd have to hit
>> zero on the first try most of the time to give the same performance as a
>> barrier.
>>
>> So I admit I hadn't thought enough about this before, but we shouldn't
>> consider the multi-gem5 and pd-gem5 approaches as the only two possible
>> ways of doing communication & synchronization.
>>
>> I'm going to consult with our local MPI expert and see what he thinks
>> about
>> using MPI here.
>>
>> Steve
>>
>> On Thu, Jul 16, 2015 at 1:07 PM Mohammad Alian <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Regarding combining MultiIface and pdgem5 network model, my
>> understanding
>> > is that MultiIface design is tightly dependent on a centralized module
>> that
>> > do both packet forwarding and synchronization at the same place
>> (tcpserver
>> > in multi-gem5). The first thing that comes into mind is to integrate
>> > barrier process capabilities into the switch box model in pd-gem5. But
>> by
>> > doing this, we should give up on some of the pd-gem5 features that are
>> > desirable. e.g, it will refrain us from having hierarchical network
>> > topologies (having local TOR switches inside each gem5 process which is
>> > simulating a rack), also it would introduce subtle issues if we want to
>> > integrate it among multiple synchronization domains some day.
>> >
>> > Maybe I didn't fully understand MultiIface. Gabor, please correct me if
>> I'm
>> > wrong ...
>> >
>> >
>> > I understand your concerns about the robustness of the implementation,
>> but
>> > doing synchronization independently has some benefits that you cannot
>> > achieve without it. Nevertheless, as I mentioned before, consider that
>> > packet arrival violation almost never happens, and in those rare cases
>> we
>> > can detect it and terminate simulation. Please consider that we are
>> > synchronizing gem5 processes which are orders of magnitude slower than
>> > physical hardware. Theoretically this violation happens when wall clock
>> > time of sending a data packet from source EtherTap (socket) to the
>> > destination one takes more than wall clock time of completing one global
>> > synchronization (sending a sync message to barrier process, receive back
>> > sync message from barrier and simulate a quantum), which itself has two
>> > back and forth socket communications between gem5 processes and barrier.
>> >
>> > Thanks,
>> > Mohammad
>> >
>> >
>> > On Thu, Jul 16, 2015 at 12:12 AM, Steve Reinhardt <[email protected]>
>> > wrote:
>> >
>> > > Sure, I am not saying we will get there soon, but I am glad we agree
>> on
>> > > what is desirable.
>> > >
>> > > Actually my reference to SST was not in regard to multi-threading
>> within
>> > a
>> > > host, but as a system that parallelizes the simulation of a single
>> > > cache-coherent system across multiple hosts. I am not advocating their
>> > > approach :). I was just pushing back on your statement that
>> parallelizing
>> > > the simulation of a single cache-coherent system is "questionable" by
>> > > providing a counter-example. If you want to counter that by calling
>> SST
>> > > itself questionable, go ahead; I don't know what their speedup numbers
>> > look
>> > > like, so I can neither criticize nor defend them on that point.
>> > >
>> > > When I mentioned that you could probably get decent speedups
>> > parallelizing
>> > > a large KNL-like coherent system across 4-8 cores, I was thinking of a
>> > > single-process parallel model like our multi-queue model, where the
>> > > synchronization overheads should be much lower. Also I meant "decent
>> > > speedups" with respect to optimized single-threaded simulations,
>> > factoring
>> > > in the overheads of parallelization. We haven't shown this yet, but I
>> > don't
>> > > think there are fundamental reasons it couldn't be achieved.
>> > >
>> > > Anyway, getting back to nearer-term issues, I'll say again that the
>> one
>> > > thing I clearly prefer about pd-gem5 over multi-gem5 is that it is
>> using
>> > a
>> > > real gem5 switch model, which indicates to me that it should be
>> possible
>> > to
>> > > create a single-process single-threaded gem5 simulation that gets the
>> > same
>> > > result as a parallel simulation. I don't think you can do that in
>> > > multi-gem5, since you have to have the switch model running in its own
>> > > process since it's not really a gem5 model. It's not a fatal flaw,
>> and in
>> > > the near term there may not even be significant practical
>> consequences,
>> > but
>> > > to me it's rather inelegant in that it is tying the parallelization
>> and
>> > the
>> > > simulation model very intimately together, rather than trying to
>> provide
>> > a
>> > > general framework for multi-host parallel simulation.
>> > >
>> > > For example, let's say I decided I wanted to model a non-Ethernet
>> network
>> > > (maybe InfiniBand?), and wanted to model it in more detail with
>> multiple
>> > > switches and links between the switches.  Lets further suppose that I
>> > > wanted to build a single set of IB switch and link models (as
>> SimObjects)
>> > > and use them in two modes: one with trace-driven network traffic,
>> where
>> > > perhaps a single-threaded single-process simulation would be fast
>> enough,
>> > > and in an execution-driven model, where I would want to parallelize
>> the
>> > > simulation across multiple hosts. It seems like that would be a lot
>> more
>> > > straightforward in pd-gem5.
>> > >
>> > > So at a high level it seems to me that a solution that combines the
>> > > MultiIface work from multi-gem5 with the pd-gem5 switch model would be
>> > the
>> > > best of both. I haven't looked at the code closely enough to know why
>> > that
>> > > won't work, so I'll let you tell me.
>> > >
>> > > Steve
>> > >
>> > >
>> > > On Mon, Jul 13, 2015 at 12:24 AM Andreas Hansson <
>> > [email protected]>
>> > > wrote:
>> > >
>> > > > Hi Steve,
>> > > >
>> > > > Thanks for the elaborate comments. I agree with all the points of
>> what
>> > is
>> > > > desired, but I am also painfully aware of some real empirical data
>> > points
>> > > > suggesting it will be difficult, if not impossible, to get there.
>> > > >
>> > > > To take a concrete example, you mention SST for multi-threading
>> within
>> > > one
>> > > > host. It may well give you 4-8X speedup doing so, but comparing gem5
>> > > > classic to SST, we are looking at roughly a 4X speed difference.
>> Hence,
>> > > > you only gain back what you lost in making it multi-threaded in the
>> > first
>> > > > place, and now you are using ~8X the resources. Hence my worry with
>> > > trying
>> > > > to find one mechanism or doing it all within the simulator. I hope
>> > you’re
>> > > > right (and I’m wrong), but I would like to see at least one data
>> point
>> > > > hinting that it is possible to achieve what you are describing. So
>> far
>> > I
>> > > > am not convinced.
>> > > >
>> > > > Andreas
>> > > >
>> > > >
>> > > > On 13/07/2015 05:36, "gem5-dev on behalf of Steve Reinhardt"
>> > > > <[email protected] on behalf of [email protected]> wrote:
>> > > >
>> > > > >Hi Andreas,
>> > > > >
>> > > > >Thanks for the comments---I partially agree, but I think the
>> structure
>> > > of
>> > > > >your comments is the most interesting to me, as I believe it
>> reveals a
>> > > > >difference in our thinking. I'll elaborate below. (Now that I'm
>> done,
>> > > I'll
>> > > > >apologize in advance for perhaps elaborating too much!)
>> > > > >
>> > > > >On Wed, Jul 8, 2015 at 1:23 PM Andreas Hansson <
>> > [email protected]
>> > > >
>> > > > >wrote:
>> > > > >
>> > > > >> Gents,
>> > > > >>
>> > > > >> I’ll let Gabor expound on the value of the non-synchronised
>> > > checkpoints.
>> > > > >>
>> > > > >> When it comes to the parallelisation, I think it is pretty clear
>> > that:
>> > > > >>
>> > > > >> 1. The value of parallelising a single(cache coherent) gem5
>> instance
>> > > is
>> > > > >> questionable,
>> > > > >
>> > > > >
>> > > > >I think that depends a lot on the parameters.  If you are trying to
>> > > model
>> > > > >a
>> > > > >quad-core system on a quad-core system, then I agree with you.
>> > However,
>> > > if
>> > > > >the number of simulated cores >> the number of host cores, it can
>> make
>> > > > >sense, as then each host core will model multiple simulated cores,
>> and
>> > > the
>> > > > >relative overhead of synchronization will go down. So if you're
>> trying
>> > > to
>> > > > >model something like a Knights Landing chip with 60+ cores, I
>> expect
>> > you
>> > > > >could get pretty decent speedup if you parallelized the simulation
>> > > across
>> > > > >4-8 host cores.
>> > > > >
>> > > > >Things also look a little different if you're doing heterogeneous
>> > nodes;
>> > > > >perhaps you might benefit from having one thread model all the CPUs
>> > > while
>> > > > >another thread (or few threads) are used to model the GPU.
>> > > > >
>> > > > >Note that, IIRC, the SST folks at Sandia are mostly using SST to
>> model
>> > > > >large-scale multi-threaded systems, not distributed message-passing
>> > > > >systems---and this is using MPI for parallelization, not shared
>> > memory.
>> > > > >
>> > > > >
>> > > > >> and the cost of making gem5 thread safe is high.
>> > > > >
>> > > > >
>> > > > >While this is indisputably true for the patch we have up on
>> > reviewboard,
>> > > > >I'm not convinced that's a fundamental truth.  I think that with
>> some
>> > > > >effort this cost can be driven down a lot.
>> > > > >
>> > > > >
>> > > > >> That said,
>> > > > >> if someone wants to do it, the multi-event-queue approach seems
>> > like a
>> > > > >> good start.
>> > > > >>
>> > > > >
>> > > > >No argument there.
>> > > > >
>> > > > >
>> > > > >>
>> > > > >> 2. Parallelising gem5 on the node level and the inter-node level,
>> > > using
>> > > > >> one mechanism seems like an odd goal.
>> > > > >
>> > > > >
>> > > > >When you say "node" here, do you mean host node or simulated node?
>> If
>> > > the
>> > > > >former, I agree; if the latter, I disagree.
>> > > > >
>> > > > >In particular, if you mean the latter, then the extrapolation of
>> what
>> > > > >you're saying is that we will end up with one model of a multi-node
>> > > system
>> > > > >if we're going to run the model on a single host, and a different
>> > model
>> > > of
>> > > > >the same multi-node system if we intend to run the model on
>> multiple
>> > > > >hosts---like what we see now with multi-gem5 where the switch model
>> > for
>> > > a
>> > > > >distributed simulation isn't even a gem5 model and couldn't be
>> used if
>> > > you
>> > > > >wanted to run the whole model inside a single gem5 process. Having
>> a
>> > > > >single
>> > > > >simulation model that doesn't change regardless of how we execute
>> the
>> > > > >simulation seems a lot more elegant to me, and we actually achieve
>> > that
>> > > > >with the multi-event-queue feature. Obviously there will be
>> practical
>> > > > >constraints on how a model can be partitioned across multiple host
>> > > nodes,
>> > > > >and little things like instantiating a different flavor of
>> EtherLink
>> > > > >depending on whether it's an intra-host-node or inter-host-node
>> > > connection
>> > > > >don't bother me that much, but to the extent possible I believe we
>> > > should
>> > > > >keep those as merely practical constraints and not fundamental
>> > > > >limitations.
>> > > > >
>> > > > >
>> > > > >> Just like OpenMP and OpenMPI are
>> > > > >> well suited for different communication mechanisms, I would argue
>> > that
>> > > > >>we
>> > > > >> need parallelisation techniques well suited for the systems the
>> > > > >>simulation
>> > > > >> will run on.
>> > > > >
>> > > > >
>> > > > >Yes, I agree, we need a message-based parallelization scheme for
>> > > > >multi-node
>> > > > >hosts, and a shared-memory based scheme for intra-host-node
>> > > > >parallelization. Two different techniques for two different
>> > > environments.
>> > > > >But that doesn't mean they can't co-exist & complement each other,
>> > > rather
>> > > > >than being mutually exclusive options, much like many programs are
>> > > written
>> > > > >in MPI+OpenMP.
>> > > > >
>> > > > >
>> > > > >> A very natural (and efficient) way of doing things is to map
>> > > > >> each gem5 instance (and thus simulated node), to a host machine,
>> and
>> > > > >>have
>> > > > >> the host machines communicate over Ethernet.
>> > > > >>
>> > > > >
>> > > > >That's certainly natural if the number of simulated nodes is equal
>> to
>> > > the
>> > > > >number of host nodes. It's not so obvious to me that you want every
>> > > > >simulated node in its own gem5 process, communicating over
>> sockets, if
>> > > the
>> > > > >number of simulated nodes >> the number of host nodes. Sure, given
>> > that
>> > > > >the
>> > > > >code is written, that's a quick way to get things working while we
>> > > polish
>> > > > >up the multi-event-queue fixes, but that doesn't mean it's the
>> ideal
>> > > > >long-term strategy. In particular, if you go to the degenerate case
>> > > where
>> > > > >we have multiple simulated nodes and a single host node, then using
>> > > > >multiple processes means we have two different parallelization
>> > > strategies
>> > > > >for running on a multi-core shared-memory host. Not that we would
>> (or
>> > > > >could) ban people from running multiple gem5 instances on a single
>> > > system,
>> > > > >but a more relevant question is, given a finite amount of effort,
>> > would
>> > > we
>> > > > >want to spend it on writing a shared-memory backend for MultiIface
>> or
>> > on
>> > > > >addressing the performance issues in making a single gem5 process
>> > > > >thread-safe? Obviously I favor the latter because I think it's a
>> more
>> > > > >general solution, and I believe one that will lead to higher
>> > performance
>> > > > >on
>> > > > >single-node hosts in the end.
>> > > > >
>> > > > >Note that I'm definitely not saying that all of this needs to be
>> > > > >implemented before we commit anything from multi-gem5 or pd-gem5.
>> I'm
>> > > just
>> > > > >trying to establish a vision for where we think gem5 should go with
>> > > > >respect
>> > > > >to parallelization, so that we can choose short-term steps that
>> align
>> > > best
>> > > > >with that destination, even if they are only initial steps down
>> that
>> > > path.
>> > > > >
>> > > > >Steve
>> > > > >
>> > > > >
>> > > > >>
>> > > > >> Do you agree?
>> > > > >>
>> > > > >> Andreas
>> > > > >>
>> > > > >_______________________________________________
>> > > > >gem5-dev mailing list
>> > > > >[email protected]
>> > > > >http://m5sim.org/mailman/listinfo/gem5-dev
>> > > >
>> > > >
>> > > > -- IMPORTANT NOTICE: The contents of this email and any attachments
>> are
>> > > > confidential and may also be privileged. If you are not the intended
>> > > > recipient, please notify the sender immediately and do not disclose
>> the
>> > > > contents to any other person, use it for any purpose, or store or
>> copy
>> > > the
>> > > > information in any medium.  Thank you.
>> > > >
>> > > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> > > > Registered in England & Wales, Company No:  2557590
>> > > > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>> > 9NJ,
>> > > > Registered in England & Wales, Company No:  2548782
>> > > > _______________________________________________
>> > > > gem5-dev mailing list
>> > > > [email protected]
>> > > > http://m5sim.org/mailman/listinfo/gem5-dev
>> > > >
>> > > _______________________________________________
>> > > gem5-dev mailing list
>> > > [email protected]
>> > > http://m5sim.org/mailman/listinfo/gem5-dev
>> > >
>> > _______________________________________________
>> > gem5-dev mailing list
>> > [email protected]
>> > http://m5sim.org/mailman/listinfo/gem5-dev
>> >
>> _______________________________________________
>> gem5-dev mailing list
>> [email protected]
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Reply via email to