Hi Steve,

Sorry for the misinterpretation, my comment on communication is not correct
and there are certainly values in using other programming models. My
argument is that if we could get multi-threaded gem5 working and model
multiple nodes inside one multi-threaded gem5 process (which is definitely
faster that using separate gem5 processes), then it's reasonable to assume
that the number of nodes in pd-gem5 would not exceed a couple of gem5
processes, and as we can synchronize pd-gem5 nodes in higher granularity,
then it will fade the value of having higher performance communication
programming models. Nevertheless, although pd-gem5 doesn't provide abstract
functions for communication, I think the effort for implementing it with
other programming model would be in par with effort needed for multi-gem5.

Maybe I'm missing something, but I didn't understand your point here: "Note
that this doesn't strictly require that the switch model is co-located with
this central coordination process; that just happens to be convenient and
efficient.". I cannot see how we can have both central server and switch
model (potentially distributed), because central server do packet routing
which supposed to be done in switch model.

Thank you for proposing the reduction method for ensuring on-time packet
delivery. We can implement this in pd-gem5, and I'm working on that right
now. Also, I modified pd-gem5 patch in order to enable replication of same
simulation inside one process as well as distributing switch model within
full-system gem5 processes (hierarchical network topology). This should
also work with multi-threaded gem5 as the synchronization of pd-gem5 nodes
is independent of internal synchronizations of multi-threaded gem5
processes. I'll update pd-gem5 patch soon.

Thank you,
Mohammad

On Sat, Jul 18, 2015 at 7:44 PM, Steve Reinhardt <[email protected]> wrote:

> Hi Mohammad,
>
> Thanks for the summaries & responses.
>
> I agree with your summaries on synchronization and checkpointing. However,
> as far as communication goes, I'd like to clarify, as I'm not sure exactly
> what you mean by "communicating through socket is sufficient and we don’t
> need
> to expand this with other programming models". Socket communication is
> sufficient for now, but I think there is a potentially a lot of value in
> being able to take advantage of higher-performance networking models such
> as MPI and InfiniBand. That's what I like about MultiIface, is that it
> provides an abstraction that should enable the development of other
> messaging layers. It's a little premature to know exactly how well that
> will work until you try and develop a second implementation, but at least
> the concept is right.
>
> As far as synchronization, that's a harder problem. I do think we need to
> have a model that always works, not almost always works. This is
> particularly challenging with sockets, which weren't built for fine-grain
> communication and synchronization, which is one reason why I think there
> would be a lot of value in moving to more HPC-oriented communication models
> like MPI on systems that support it. In fact, I know that the common MPI
> platforms (MPICH and Open MPI) both support Ethernet transport, so it might
> even be the case that coding to MPI would provide performance that's as
> good as or better then going directly to sockets.
>
> The advantage of having all communication routed through a single central
> server (as in multi-gem5) is that you can provide ordering guarantees
> between the point-to-point messages and the barrier messages, so that you
> can guarantee that a message that's sent before a barrier is initiated has
> been received and processed before that barrier completes. Note that this
> doesn't strictly require that the switch model is co-located with this
> central coordination process; that just happens to be convenient and
> efficient.
>
> There are other ways to guarantee all messages have been delivered other
> than just using socket ordering. For example, if you have each node track
> the net number of messages it has sent (msgs sent - msgs rcvd) and do a
> reduction on this value instead of a simpler barrier, then you now all
> messages have been delivered when this value reaches zero. (In fact, IIRC,
> that's what we did in WWT, using the CM-5 hardware reduction network.) You
> have to iterate over the reduction until the value is zero, which could
> theoretically cost some performance, but depending on the timing of the
> network the number of iterations would be small---though you'd have to hit
> zero on the first try most of the time to give the same performance as a
> barrier.
>
> So I admit I hadn't thought enough about this before, but we shouldn't
> consider the multi-gem5 and pd-gem5 approaches as the only two possible
> ways of doing communication & synchronization.
>
> I'm going to consult with our local MPI expert and see what he thinks about
> using MPI here.
>
> Steve
>
> On Thu, Jul 16, 2015 at 1:07 PM Mohammad Alian <[email protected]> wrote:
>
> > Hi,
> >
> > Regarding combining MultiIface and pdgem5 network model, my understanding
> > is that MultiIface design is tightly dependent on a centralized module
> that
> > do both packet forwarding and synchronization at the same place
> (tcpserver
> > in multi-gem5). The first thing that comes into mind is to integrate
> > barrier process capabilities into the switch box model in pd-gem5. But by
> > doing this, we should give up on some of the pd-gem5 features that are
> > desirable. e.g, it will refrain us from having hierarchical network
> > topologies (having local TOR switches inside each gem5 process which is
> > simulating a rack), also it would introduce subtle issues if we want to
> > integrate it among multiple synchronization domains some day.
> >
> > Maybe I didn't fully understand MultiIface. Gabor, please correct me if
> I'm
> > wrong ...
> >
> >
> > I understand your concerns about the robustness of the implementation,
> but
> > doing synchronization independently has some benefits that you cannot
> > achieve without it. Nevertheless, as I mentioned before, consider that
> > packet arrival violation almost never happens, and in those rare cases we
> > can detect it and terminate simulation. Please consider that we are
> > synchronizing gem5 processes which are orders of magnitude slower than
> > physical hardware. Theoretically this violation happens when wall clock
> > time of sending a data packet from source EtherTap (socket) to the
> > destination one takes more than wall clock time of completing one global
> > synchronization (sending a sync message to barrier process, receive back
> > sync message from barrier and simulate a quantum), which itself has two
> > back and forth socket communications between gem5 processes and barrier.
> >
> > Thanks,
> > Mohammad
> >
> >
> > On Thu, Jul 16, 2015 at 12:12 AM, Steve Reinhardt <[email protected]>
> > wrote:
> >
> > > Sure, I am not saying we will get there soon, but I am glad we agree on
> > > what is desirable.
> > >
> > > Actually my reference to SST was not in regard to multi-threading
> within
> > a
> > > host, but as a system that parallelizes the simulation of a single
> > > cache-coherent system across multiple hosts. I am not advocating their
> > > approach :). I was just pushing back on your statement that
> parallelizing
> > > the simulation of a single cache-coherent system is "questionable" by
> > > providing a counter-example. If you want to counter that by calling SST
> > > itself questionable, go ahead; I don't know what their speedup numbers
> > look
> > > like, so I can neither criticize nor defend them on that point.
> > >
> > > When I mentioned that you could probably get decent speedups
> > parallelizing
> > > a large KNL-like coherent system across 4-8 cores, I was thinking of a
> > > single-process parallel model like our multi-queue model, where the
> > > synchronization overheads should be much lower. Also I meant "decent
> > > speedups" with respect to optimized single-threaded simulations,
> > factoring
> > > in the overheads of parallelization. We haven't shown this yet, but I
> > don't
> > > think there are fundamental reasons it couldn't be achieved.
> > >
> > > Anyway, getting back to nearer-term issues, I'll say again that the one
> > > thing I clearly prefer about pd-gem5 over multi-gem5 is that it is
> using
> > a
> > > real gem5 switch model, which indicates to me that it should be
> possible
> > to
> > > create a single-process single-threaded gem5 simulation that gets the
> > same
> > > result as a parallel simulation. I don't think you can do that in
> > > multi-gem5, since you have to have the switch model running in its own
> > > process since it's not really a gem5 model. It's not a fatal flaw, and
> in
> > > the near term there may not even be significant practical consequences,
> > but
> > > to me it's rather inelegant in that it is tying the parallelization and
> > the
> > > simulation model very intimately together, rather than trying to
> provide
> > a
> > > general framework for multi-host parallel simulation.
> > >
> > > For example, let's say I decided I wanted to model a non-Ethernet
> network
> > > (maybe InfiniBand?), and wanted to model it in more detail with
> multiple
> > > switches and links between the switches.  Lets further suppose that I
> > > wanted to build a single set of IB switch and link models (as
> SimObjects)
> > > and use them in two modes: one with trace-driven network traffic, where
> > > perhaps a single-threaded single-process simulation would be fast
> enough,
> > > and in an execution-driven model, where I would want to parallelize the
> > > simulation across multiple hosts. It seems like that would be a lot
> more
> > > straightforward in pd-gem5.
> > >
> > > So at a high level it seems to me that a solution that combines the
> > > MultiIface work from multi-gem5 with the pd-gem5 switch model would be
> > the
> > > best of both. I haven't looked at the code closely enough to know why
> > that
> > > won't work, so I'll let you tell me.
> > >
> > > Steve
> > >
> > >
> > > On Mon, Jul 13, 2015 at 12:24 AM Andreas Hansson <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi Steve,
> > > >
> > > > Thanks for the elaborate comments. I agree with all the points of
> what
> > is
> > > > desired, but I am also painfully aware of some real empirical data
> > points
> > > > suggesting it will be difficult, if not impossible, to get there.
> > > >
> > > > To take a concrete example, you mention SST for multi-threading
> within
> > > one
> > > > host. It may well give you 4-8X speedup doing so, but comparing gem5
> > > > classic to SST, we are looking at roughly a 4X speed difference.
> Hence,
> > > > you only gain back what you lost in making it multi-threaded in the
> > first
> > > > place, and now you are using ~8X the resources. Hence my worry with
> > > trying
> > > > to find one mechanism or doing it all within the simulator. I hope
> > you’re
> > > > right (and I’m wrong), but I would like to see at least one data
> point
> > > > hinting that it is possible to achieve what you are describing. So
> far
> > I
> > > > am not convinced.
> > > >
> > > > Andreas
> > > >
> > > >
> > > > On 13/07/2015 05:36, "gem5-dev on behalf of Steve Reinhardt"
> > > > <[email protected] on behalf of [email protected]> wrote:
> > > >
> > > > >Hi Andreas,
> > > > >
> > > > >Thanks for the comments---I partially agree, but I think the
> structure
> > > of
> > > > >your comments is the most interesting to me, as I believe it
> reveals a
> > > > >difference in our thinking. I'll elaborate below. (Now that I'm
> done,
> > > I'll
> > > > >apologize in advance for perhaps elaborating too much!)
> > > > >
> > > > >On Wed, Jul 8, 2015 at 1:23 PM Andreas Hansson <
> > [email protected]
> > > >
> > > > >wrote:
> > > > >
> > > > >> Gents,
> > > > >>
> > > > >> I’ll let Gabor expound on the value of the non-synchronised
> > > checkpoints.
> > > > >>
> > > > >> When it comes to the parallelisation, I think it is pretty clear
> > that:
> > > > >>
> > > > >> 1. The value of parallelising a single(cache coherent) gem5
> instance
> > > is
> > > > >> questionable,
> > > > >
> > > > >
> > > > >I think that depends a lot on the parameters.  If you are trying to
> > > model
> > > > >a
> > > > >quad-core system on a quad-core system, then I agree with you.
> > However,
> > > if
> > > > >the number of simulated cores >> the number of host cores, it can
> make
> > > > >sense, as then each host core will model multiple simulated cores,
> and
> > > the
> > > > >relative overhead of synchronization will go down. So if you're
> trying
> > > to
> > > > >model something like a Knights Landing chip with 60+ cores, I expect
> > you
> > > > >could get pretty decent speedup if you parallelized the simulation
> > > across
> > > > >4-8 host cores.
> > > > >
> > > > >Things also look a little different if you're doing heterogeneous
> > nodes;
> > > > >perhaps you might benefit from having one thread model all the CPUs
> > > while
> > > > >another thread (or few threads) are used to model the GPU.
> > > > >
> > > > >Note that, IIRC, the SST folks at Sandia are mostly using SST to
> model
> > > > >large-scale multi-threaded systems, not distributed message-passing
> > > > >systems---and this is using MPI for parallelization, not shared
> > memory.
> > > > >
> > > > >
> > > > >> and the cost of making gem5 thread safe is high.
> > > > >
> > > > >
> > > > >While this is indisputably true for the patch we have up on
> > reviewboard,
> > > > >I'm not convinced that's a fundamental truth.  I think that with
> some
> > > > >effort this cost can be driven down a lot.
> > > > >
> > > > >
> > > > >> That said,
> > > > >> if someone wants to do it, the multi-event-queue approach seems
> > like a
> > > > >> good start.
> > > > >>
> > > > >
> > > > >No argument there.
> > > > >
> > > > >
> > > > >>
> > > > >> 2. Parallelising gem5 on the node level and the inter-node level,
> > > using
> > > > >> one mechanism seems like an odd goal.
> > > > >
> > > > >
> > > > >When you say "node" here, do you mean host node or simulated node?
> If
> > > the
> > > > >former, I agree; if the latter, I disagree.
> > > > >
> > > > >In particular, if you mean the latter, then the extrapolation of
> what
> > > > >you're saying is that we will end up with one model of a multi-node
> > > system
> > > > >if we're going to run the model on a single host, and a different
> > model
> > > of
> > > > >the same multi-node system if we intend to run the model on multiple
> > > > >hosts---like what we see now with multi-gem5 where the switch model
> > for
> > > a
> > > > >distributed simulation isn't even a gem5 model and couldn't be used
> if
> > > you
> > > > >wanted to run the whole model inside a single gem5 process. Having a
> > > > >single
> > > > >simulation model that doesn't change regardless of how we execute
> the
> > > > >simulation seems a lot more elegant to me, and we actually achieve
> > that
> > > > >with the multi-event-queue feature. Obviously there will be
> practical
> > > > >constraints on how a model can be partitioned across multiple host
> > > nodes,
> > > > >and little things like instantiating a different flavor of EtherLink
> > > > >depending on whether it's an intra-host-node or inter-host-node
> > > connection
> > > > >don't bother me that much, but to the extent possible I believe we
> > > should
> > > > >keep those as merely practical constraints and not fundamental
> > > > >limitations.
> > > > >
> > > > >
> > > > >> Just like OpenMP and OpenMPI are
> > > > >> well suited for different communication mechanisms, I would argue
> > that
> > > > >>we
> > > > >> need parallelisation techniques well suited for the systems the
> > > > >>simulation
> > > > >> will run on.
> > > > >
> > > > >
> > > > >Yes, I agree, we need a message-based parallelization scheme for
> > > > >multi-node
> > > > >hosts, and a shared-memory based scheme for intra-host-node
> > > > >parallelization. Two different techniques for two different
> > > environments.
> > > > >But that doesn't mean they can't co-exist & complement each other,
> > > rather
> > > > >than being mutually exclusive options, much like many programs are
> > > written
> > > > >in MPI+OpenMP.
> > > > >
> > > > >
> > > > >> A very natural (and efficient) way of doing things is to map
> > > > >> each gem5 instance (and thus simulated node), to a host machine,
> and
> > > > >>have
> > > > >> the host machines communicate over Ethernet.
> > > > >>
> > > > >
> > > > >That's certainly natural if the number of simulated nodes is equal
> to
> > > the
> > > > >number of host nodes. It's not so obvious to me that you want every
> > > > >simulated node in its own gem5 process, communicating over sockets,
> if
> > > the
> > > > >number of simulated nodes >> the number of host nodes. Sure, given
> > that
> > > > >the
> > > > >code is written, that's a quick way to get things working while we
> > > polish
> > > > >up the multi-event-queue fixes, but that doesn't mean it's the ideal
> > > > >long-term strategy. In particular, if you go to the degenerate case
> > > where
> > > > >we have multiple simulated nodes and a single host node, then using
> > > > >multiple processes means we have two different parallelization
> > > strategies
> > > > >for running on a multi-core shared-memory host. Not that we would
> (or
> > > > >could) ban people from running multiple gem5 instances on a single
> > > system,
> > > > >but a more relevant question is, given a finite amount of effort,
> > would
> > > we
> > > > >want to spend it on writing a shared-memory backend for MultiIface
> or
> > on
> > > > >addressing the performance issues in making a single gem5 process
> > > > >thread-safe? Obviously I favor the latter because I think it's a
> more
> > > > >general solution, and I believe one that will lead to higher
> > performance
> > > > >on
> > > > >single-node hosts in the end.
> > > > >
> > > > >Note that I'm definitely not saying that all of this needs to be
> > > > >implemented before we commit anything from multi-gem5 or pd-gem5.
> I'm
> > > just
> > > > >trying to establish a vision for where we think gem5 should go with
> > > > >respect
> > > > >to parallelization, so that we can choose short-term steps that
> align
> > > best
> > > > >with that destination, even if they are only initial steps down that
> > > path.
> > > > >
> > > > >Steve
> > > > >
> > > > >
> > > > >>
> > > > >> Do you agree?
> > > > >>
> > > > >> Andreas
> > > > >>
> > > > >_______________________________________________
> > > > >gem5-dev mailing list
> > > > >[email protected]
> > > > >http://m5sim.org/mailman/listinfo/gem5-dev
> > > >
> > > >
> > > > -- IMPORTANT NOTICE: The contents of this email and any attachments
> are
> > > > confidential and may also be privileged. If you are not the intended
> > > > recipient, please notify the sender immediately and do not disclose
> the
> > > > contents to any other person, use it for any purpose, or store or
> copy
> > > the
> > > > information in any medium.  Thank you.
> > > >
> > > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > > > Registered in England & Wales, Company No:  2557590
> > > > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> > 9NJ,
> > > > Registered in England & Wales, Company No:  2548782
> > > > _______________________________________________
> > > > gem5-dev mailing list
> > > > [email protected]
> > > > http://m5sim.org/mailman/listinfo/gem5-dev
> > > >
> > > _______________________________________________
> > > gem5-dev mailing list
> > > [email protected]
> > > http://m5sim.org/mailman/listinfo/gem5-dev
> > >
> > _______________________________________________
> > gem5-dev mailing list
> > [email protected]
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to