Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Mohammad Alian Wed, 08 Jul 2015 13:06:13 -0700

Thanks for participating in this conversation Steve.

Beside the usefulness of unsync checkpoints which is arguable, my
understanding is that it has some issues in terms of correctness. In the
example that Steve laid out, after restoring from checkpoint, one node is
doing simulation to catch up to the other nodes, while others are waiting
(don't do simulation). Then what will happen if the trailing node wants to
communicate with others while completing the first interrupted periodic
sync? (This is related to the issue that I said in my previous email)


Regarding integration with multi-threaded gem5, I'm pretty sure that we can
do hierarchical modeling of a cluster using pd-gem5 with single-threaded
gem5 (having local switches in each gem5 process), actually this is the
main reasons that we choose to simulate the switch box inside gem5. I don't
see anything that hampers us from integrating pd-gem5 with multi-threaded
gem5, because the synchronization between pd-gem5 nodes is done
independently from internal synchronization of multithreaded-gem5. From
high-level, pd-gem5 connects several independent gem5 entities together and
ensure accurate, deterministic and timing communication between these
independent entities (full-systems).

Thanks,
Mohammad

On Wed, Jul 8, 2015 at 1:04 PM, Steve Reinhardt <ste...@gmail.com> wrote:

> Thanks for keeping the conversation going. Sorry to not be participating
> more regularly. At a high level, I still have concerns about both
> unsynchronized checkpoints and interaction with multithreading.
>
> *Unsynchronized checkpoints:* I'm still not completely clear on the
> MPI_Barrier() example. Let's say there's one node that's far behind, while
> the other N-1 nodes reach the MPI barrier quickly. (Probably not
> unrealistic for the first MPI barrier, where there's likely a single node
> doing some serial initialization while the other nodes wait.) So the N-1
> nodes reach the MPI barrier, send out messages indicating that they've done
> so, then wait for the trailing node to catch up. Whatever way the MPI
> barrier code waits (whether it's in a spin loop, or a sleep, or whatever),
> the only thing that will prevent the simulation of those N-1 nodes from
> advancing to infinity is the first gem5 sync barrier.  So while those nodes
> are waiting at the simulated application level for an MPI barrier
> completion message, they will also all be waiting at the gem5 level at the
> first sync barrier.
>
> Now the trailing node comes along and reaches the MPI barrier at some tick
> ahead of the first sync barrier. It will see that all the other nodes have
> reached the MPI barrier, send out message(s) indicating that it has reached
> the MPI barrier, and potentially continue past the MPI barrier. Meanwhile
> the other nodes will receive MPI barrier completion messages, but the
> earliest timestamp at which they can process those messages is at the time
> of the gem5 sync barrier, because they have all simulated ahead and are
> waiting there. So they will all wait while the trailing node simulates up
> to the gem5 sync barrier.
>
> Thus I'll argue that, in this case, the only part of ROI simulation that
> you capture by doing an unsynchronized checkpoint at the beginning of the
> MPI_barrier() call rather than a synchronized checkpoint as part of the
> first gem5 sync barrier is a completely unrealistic part of the ROI where
> one node is executing serially just to catch up to the other nodes, which
> have simulated too far ahead because the first sync was placed too far in
> the future (as part of an expected performance/accuracy tradeoff). Do you
> agree or disagree? I assume that, since you think unsynchronized
> checkpoints are important, there must be more to it than this, but I'm
> laying out this scenario so you can explain to me where I'm wrong and why
> unsynchronized checkpoints are more useful than they seem to me.
>
> *Interaction with multithreading:* I guess there are several different
> things to say in this area. First, apologies that the multithreading code
> is not as robust as I recalled.  I knew we had it working internally, but
> forgot that there was still that outstanding patch that only works for x86
> and needs some additional polishing before it's ready to be committed.
> Nevertheless, that's work that really should be done, so I don't want us to
> make strategic decisions about our parallelization strategy based on the
> interim, fixable status of this piece of code.
>
> Second, to correct Gabor's misunderstanding, the problems with the
> multithreading code don't have anything to do with EtherLink object. It
> turns out that the way the event scheduling functions are written, they
> automatically know whether you're scheduling on the object's "own" event
> queue or not, and stuff just works more cleanly than we even expected it
> would. The problems are really lack of thread safety in other parts of the
> code, in areas such as the decoded instruction cache, reference counting,
> and syscall emulation functions (see http://reviews.gem5.org/r/2320 for
> details).
>
> Third, as far as a shared-memory transport for MultiIface: I understand
> that it could be done, and might not even be that hard. My point is that,
> assuming the multi-threading issues are addressed, I believe it would be
> redundant (and potentially confusing to users) to have two different ways
> of running gem5 in parallel on a single shared-memory host.
>
> Actually my main point is not directly related to multithreading, but more
> just the generality of the model; why is that multi-gem5 would not support
> a hierarchical simulation (even without multithreading), where each gem5
> process modeled several nodes and a local switch, with MultiEtherLink
> objects used only to connect the local switches with a global switch?
>
> Thanks,
>
> Steve
>
> On Wed, Jul 8, 2015 at 6:48 AM Gabor Dozsa <gabor.do...@arm.com> wrote:
>
> > Thanks for the example, I can see your point. Indeed, a gem5 process can
> > miss a receive tick when we restore from a checkpoint with a smaller link
> > latency but this can only happen while that particular gem5 is blocked
> > waiting for the very first periodic sync to complete. This is already
> > handled as a special state in multi-gem5(since we have to complete the
> > “interrupted” periodic sync right after restoring from checkpoint.
> >
> > - Gabor
> >
> > On 7/7/15, 6:10 PM, "Mohammad Alian" <al...@wisc.edu> wrote:
> >
> > >I think you did't understand my point. I'll explain it with an example.
> > >
> > >>> A receive
> > >>> tick of a packet cannot fall into the current quantum so every packet
> > >>>can
> > >>> get scheduled for eceive properly even if a checkpoint/restore
> happens
> > >>> during a quantum.
> > >
> > >This assumption is true when "quantum size <= link_latency". But
> > >link_latency is not fixed, it's a parameter.
> > >Assume you take checkpoint with q=10 and have 3 nodes and take
> checkpoint
> > >@tick=11 on node0. Then assume this is the tick value of nodes when you
> > >take unsync ckpt: node0:11, node1:20, node2:20. If you restore with
> > >quantum
> > >smaller than 10, then your above statement does not hold. So you cannot
> > >restore from a checkpoint with link_latency smaller than the value that
> > >you
> > >took checkpoint with!
> > >
> > >Mohammad
> > >
> > >On Tue, Jul 7, 2015 at 11:05 AM, Gabor Dozsa <gabor.do...@arm.com>
> wrote:
> > >
> > >> Mohammad, I’m not sure what you mean by “taking a checkpoint with
> > >>quantum
> > >> size smaller than link latency”.
> > >>
> > >> In multi-gem5, thequantum size and the checkpoint is completely
> > >> independent. The quantum is the numbe of ticks simulated between two
> > >> consecutive periodic sync - that’s why every periodic sync happens at
> > >>the
> > >> same tick at each gem5 process. A checkpoint can be taken at any point
> > >> within a quantum. After the checkpoint is taken, each gem5 rocess
> > >> completes what remained from the current quantum and then enters the
> > >>next
> > >> periodic sync.
> > >>
> > >> When fast-forwarding, you can increase link latency to allow larger
> > >> quantum and reduce periodic sync overhead. Does that make sense?
> > >>
> > >> - Gabor
> > >>
> > >> On 7/7/15, 4:11 PM, "Mohammad Alian" <al...@wisc.edu> wrote:
> > >>
> > >> >Then you are assuming taking checkpoint with quantum size smaller
> than
> > >> >link
> > >> >latency which contradicts your initial motivation for unsync
> > >>checkpoint!:
> > >> >(I copied this sentence from earlier messages in the thread as a
> > >>reminder)
> > >> >"Shortening the quantum canhelp, but usually the snapshot is being
> > >>taken
> > >> >while 'fast-forwarding', i.e. simulating as fast s possible, which
> > >>woul
> > >> >motivate a longer quantum."
> > >> >
> > >> >What if somebody wants t relax synchronization and take checkpoint?
> > >> >
> > >> >On Tue, Jul7, 2015 at 7:38 AM, Gabor Dozsa <gabor.do...@arm.com>
> > >>wrote:
> > >> >
> > >> >>
> > >> >> Hi Mohammad and all,
> > >> >>
> > >> >> gem5 processes may restore at a different tick from a checkpoint
> but
> > >>the
> > >> >> next periodic sync will hapen at the same tick in all gm5. A
> receive
> > >> >> tick of a packet cannot fall into the current quantum so every
> packet
> > >> >>can
> > >> >> get scheduled for eceive properly even if a checkpoint/restore
> > >>happens
> > >> >> during a quantum.
> > >> >>
> > >> >> Regarding your multi-threaded dual config, my understanding is that
> > >> >> EtherLink is not prepared to work with multi threading as it lacks
> > >> >>thread
> > >> >> safety. The multiple event queues/threads config only works if the
> > >> >>systems
> > >> >> are independent.
> > >> >>
> > >> >> One possible way to fix that is to provide a "multi-thread” based
> > >> >> implementation for MultiIface ;-)
> > >> >>
> > >> >> - Gabor
> > >> >>
> > >> >> On 7/7/15, 6:29 AM, "Mohammad Alian" <al...@wisc.edu> wrote:
> > >> >>
> > >> >> >Gabor- My concern about unsync checkpoint is that when you restore
> > >> >>from an
> > >> >> >unsync checkpoin, you'll have gem5 processes that each is running
> in
> > >> >> >different tick. Then how do you handle accurate delivery of
> packets
> > >> >> >between
> > >> >> >these gem5 processes? It willalso make it harder to integrate
> > >> >> >multi/pd-gem5 with current multi-threaded gem5. The problem with
> > >>sync
> > >> >> >checkpoint is that you cannot exactl take checkpoint at ROI, but I
> > >> >>think
> > >> >> >unsync checkpoint introduces some other problems. Considering the
> > >> >> >necessay
> > >> >> >warmup periodbefore starting stat collection, I think we don't
> need
> > >>to
> > >> >> >exactly pinpont the ROI. Please correct me if I'm wrong.
> > >> >> >
> > >> >> >I'm trying to run a multi-threaded experiment with pd-gem5, but I
> > >>got
> > >> >>an
> > >> >> >error when I tried to partition dual mode simulation on two
> > >>threads. I
> > >> >> >posted that in gem5 users mailing list. Please help me on that if
> > >>you
> > >> >>can.
> > >> >> >
> > >> >> >Thank you,
> > >> >> >ohammad
> > >> >> >
> > >> >> >On Mon, Jul 6, 2015 at 11:45 AM, Gabor Dozsa <gabor.do...@arm.com
> >
> > >> >>wrote:
> > >> >> >
> > >> >> >> Thank you Steve for the detaile elaboration on the issues.
> > >> >> >>
> > >> >> >>
> > >> >> >> Regarding the “unsynchronized checkpoints”, the terminology
> might
> > >>be
> > >> >>a
> > >> >> >>bit
> > >> >> >> confusing. In fact, w always need to do a global synchronization
> > >> >>among
> > >> >> >> the gem5 processes bfore taking a distributed checkpoint (in
> order
> > >> >>to
> > >> >> >> avoid in-flight packets). The global synchronization here means
> > >>that
> > >> >> >>each
> > >> >> >> gem5 has to suspend the simulation and wait until every
> in-flight
> > >> >> >>packets
> > >> >> >> arrives (and is stored) at the destination gem5 process. If that
> > >> >>global
> > >> >> >> synchronization step happens at the same simulated tick in each
> > >>gem5
> > >> >> >>then
> > >> >> >> the we call the checkpoint “synchronous” otherwise it is an
> > >> >> >>“asynchronous”
> > >> >> >> checkpoint.
> > >> >> >>
> > >> >> >> In the MPI application example I mentioned before the checkpoint
> > >> >>should
> > >> >> >>be
> > >> >> >> triggered as soon as the “slowest” MPI process reaches the
> > >> >> >>MPI_barrier().
> > >> >> >> The problem is that the “slowest” MPI process usually does not
> > >>reach
> > >> >>the
> > >> >> >> MPI_barrier() right at the end of the current quantum. If we let
> > >>the
> > >> >> >> simulation continue until the quantum completes (to ensure tat
> > >>the
> > >> >> >> checkpoint is taken at the same simulated tick n each gem5) then
> > >>the
> > >> >> >>MPI
> > >> >> >> processes will complete the MPI_barrier and start executing the
> > >>ROI
> > >> >>code
> > >> >> >> already.
> > >> >> >>
> > >> >> >> Regarding the integration of multi-threaded/multi-host
> simulation,
> > >> >> >> multi-gem5 does not support fine grainsimulation of hierarchical
> > >> >> >>switches
> > >> >> >> (or any other network topoogies except a single crossbar) or
> > >> >>multiple
> > >> >> >> synchronization domains currently.
> > >> >> >>
> > >> >> >> However, I'm a bit confused about your statement that you don’t
> > >>see
> > >> >> >>value
> > >> >> >> in ever building a shared-memory transport for MultiIface.
> > >> >>MultiIface in
> > >> >> >> my view is just an abstract interface for “multi-(ether)-link"
> > >> >>objects
> > >> >> >> which are link objects for connecting multiple (i.e. more than
> > >>two)
> > >> >> >> systems. It aims to encapsulate the API necessary for any Link
> > >>object
> > >> >> >> in a any multi-system configuration - provided that we partition
> > >>the
> > >> >> >> systems across network links during run time.
> > >> >> >>
> > >> >> >> An orthogonal issue is if we want to include a simple crossbar
> > >>switch
> > >> >> >> model in a MultiIface implementation or we want to provide a
> > >> >> >>‘standalone'
> > >> >> >> fine  grain model for the switch (e.g. the pd-gem5 approach).
> > >> >> >>
> > >> >> >> Thanks,
> > >> >> >> - Gabor
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On 7/3/15, 7:33 PM, "Steve Reinhardt" <ste...@gmail.com> wrote:
> > >> >> >>
> > >> >> >> >Thanks Mohammad & Gabor for the responses.
> > >> >> >> >
> > >> >> >> >I think there's still some misunderstanding on what I mean by
> the
> > >> >> >> >integration of multi-threaded and multi-host simulation based
> on
> > >> >> >>Gabor's
> > >> >> >> >rsponse above and Andreas's response in the other thread.
> > >> >> >> >
> > >> >> >> >The primary example scenario I'm proposing is as Mohammad
> > >>described:
> > >> >> >> >within
> > >> >> >> >each host node, we're imulating an entire rack + top-of-rack
> > >>switch
> > >> >> >>in a
> > >> >> >> >single gem5 process, with separate event queues/threads being
> > >>used
> > >> >>to
> > >> >> >> >parallelize across nodes within the rack. The switch may or may
> > >>not
> > >> >>be
> > >> >> >>on
> > >> >> >> >its own thread as well.  The synchronization among the threads
> > >>only
> > >> >> >>needs
> > >> >> >> >to be at the granularity of the intra-rack network latency.
> > >> >> >> >
> > >> >> >> >Now we want to expand this by using pd-gem5 or multi-gem5 to
> > >> >> >>parallelize
> > >> >> >> >multiple of these rack-level simulations across hosts, so we
> can
> > >> >> >>simulate
> > >> >> >> >a
> > >> >> >> >whole row of a datacenter.  Only the uplinks from the TOR
> > >>switches
> > >> >> >>would
> > >> >> >> >need to go over sockets between processes, and the switch being
> > >> >> >>modeled by
> > >> >> >> >pd-gem5 or multi-gem5 would be the end-of-row switch. The
> > >> >> >>synchronization
> > >> >> >> >delay among the multiple gem5 processes would be based on the
> > >> >> >>inter-rack
> > >> >> >> >latency.
> > >> >> >> >
> > >> >> >> >So the basic question is: Is ths feasible with pd-gem5 /
> > >> >>multi-gem5,
> > >> >> >>and
> > >> >> >> >if not, how much work would it take to make it so?
> > >> >> >> >
> > >> >> >> >However, my larger point is that I still do't see value in ever
> > >> >> >>building
> > >> >> >> >a
> > >> >> >> >shared-memory transport for MultiIface. For this model, there
> is
> > >> >> >>clearly
> > >> >> >>>no
> > >> >> >> >need for it. Things get more complicated if we want to do
> > >>something
> > >> >> >>like
> > >> >> >> >have N nodes connected to a single switch and split that over
> two
> > >> >>hosts
> > >> >> >> >(with N/2 nodes simulated on each), but even in that case, I
> > >>think
> > >> >> >>it's a
> > >> >> >> >better idea to make the switch model deal with having half of
> its
> > >> >>links
> > >> >> >> >internal and half external (since we already want the same
> model
> > >>to
> > >> >> >>work
> > >> >> >> >in
> > >> >> >> >both the all-internal and all-external cases). Not that I'm
> > >>worried
> > >> >> >>that
> > >> >> >> >someone is about to g off and build this shared-memory
> > >>transport,
> > >> >>but
> > >> >> >>I
> > >> >> >> >think it's important to reach an understanding here, since it's
> > >> >> >> >fundamental
> > >> >> >> >to definig the strategic relationship between these
> capabilities
> > >> >>going
> > >> >> >> >forward.
> > >> >> >> >
> > > >> >> >Stepping back a little further, it would be nice to have a model
> > >> >>that
> > >> >> >>is
> > >> >> >> >as
> > >> >> >> >generic as the multi-threading model, where it's really just a
> > >> >>matter
> > >> >> >>of
> > >> >> >> >taking a simulation, partitioning the components among the
> > >>threads,
> > >> >>and
> > >> >> >> >setting the synchronization quantum, and it works. Of course,
> > >>even
> > >> >>with
> > >> >> >> >the
> > >> >> >> >multi-threaded model, if you don't choose your partitioning and
> > >>your
> > >> >> >> >quantum wisely, you're not going to get much speedup or a
> > >> >>deterministic
> > >> >> >> >simulation, but the fundamental implementation is oblivious to
> > >>that.
> > >> >> >>I'm
> > >> >> >> >not saying we really need to go all the way to this
> > >>extreme---it's
> > >> >> >>pretty
> > >> >> >> >reasonable to assume that no one in the near future will want
> to
> > >> >> >>partition
> > >> >> >> >across hosts anywhere otherthan on a simulated network
> > >>link---but I
> > >> >> >>think
> > >> >> >> >we should keep this idealin mind as a guiding principle as we
> > >> >>choose
> > >> >> >>how
> > >> >> >> >to go foward from here.
> > >> >> >> >
> > >> >> >> >This ties in to my point #4, which is that if we're really
> > >>building
> > >> >>a
> > >> >> >> >mechanism to partition a simulation across multiple hosts, then
> > >>you
> > >> >> >>should
> > >> >> >> >be able to run the same simuation in a single gem5 process and
> > >>get
> > >> >>the
> > >> >> >> >same results. I think this is the strength of pd-gem5;
> > >> >>correspondingly
> > >> >> >the
> > >> >> >> >main weakness of multi-gem5 is that it architecturally feels
> more
> > >> >>like
> > >> >> >> >tying together a set of mostly independent gem5 simulations
> than
> > >> >>like
> > >> >> >> >partitioning a single gem5 simulation.  (Of course, they both
> end
> > >> >>up at
> > >> >> >> >roughly the same point in the middle.)
> > >> >> >> >
> > >> >> >> >On the flip side, multi-gem5 has some clear advantages in terms
> > >>of
> > >> >>the
> > >> >> >> >better separation of the communication layer (and I can imagine
> > >>it
> > >> >> >>beig
> > >> >> >> >very useful to port to MPI and perhaps some RDMA API for
> > >>InfiniBand
> > >> >> >> >clusters). Also I think the integrated sockets for communicatio
> > >>and
> > >> >> >> >syncrhonization are the superior design; while the separate
> > >sockets
> > >> >> >>used
> > >> >> >> >by
> > >> >> >> >pd-gem5 may only very rarely cause problems, I agree with
> Andreas
> > >> >>that
> > >> >> >> >that's not good enough, and I don't see any real advantage
> > >> >>either---if
> > >> >> >>you
> > >> >> >> >have to flush the data sockets (or wait for them to drain)
> before
> > >> >> >> >synchronizing, then you might as well just have the
> > >>synchronization
> > >> >> >> >messages queue up behind the data messages.
> > >> >> >> >
> > >> >> >> >Regarding unsynchronized checkpoints: Thanks for the example,
> but
> > >> >>I'm
> > >> >> >> >still
> > >> >> >> >a little confused. If all the processes are about to execute an
> > >> >> >> >MPI_Barrier(), doesn't that mean they'll all be synchronized
> > >>shortly
> > >> >> >> >anyway? So what's the harm until waiting until they're
> > >>synchronized
> > >> >>and
> > >> >> >> >then checkpointing?
> > >> >> >> >
> > >> >> >> >Regarding the simulation of non-Ethernet networks: I agree that
> > >>the
> > >> >> >> >biggest
> > >> >> >> >obstacle to this is he lack of generality of the current gem5
> > >> >>network
> > >> >> >> >components. I tried to take a step toward supporting other link
> > >> >>types
> > >> >> >>two
> > >> >> >> >years ago (see http://reviews.gem5.org/r/1922) but someone
> shot
> > >>me
> > >> >> down
> > >> >> >>>;).
> > >> >> >> >We shouldn't try and fix that here, but we should also
> > >>consciously
> > >> >>try
> > >> >> >>not
> > >> >> >> >to make it any worse...
> > >> >> >> >
> > >> >> >> >Thanks for reading all the way to the end!
> > >> >> >> >
> > >> >> >> >Steve
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >On Fri, Jul 3, 2015 at 7:11 AM Gabor Dozsa <
> gabor.do...@arm.com>
> > >> >> wrote:
> > >> >> >> >
> > >> >> >> >>Hi all,
> > >> >> >> >>
> > >> >> >> >>Thank you Steve for the thorough review.
> > >> >> >> >>
> > >> >> >> >>First, let me elaborate a bit on Andreas’s 3rd point about
> > >> >> >> >>non-synchronous
> > >> >> >> >>checkpoints. Let’s assume that we aim to simulate MPI
> > >>applications
> > >> >> >>(HPC
> > >> >> >> >>workloads). The ROI in an MPI application is typically starts
> > >>with
> > >> >>a
> > >> >> >> >>global MPI_Barrier() call. We want to take the checkpoint when
> > >> >>*every*
> > >> >> >> >>gem5 process is reached that MPI_Barrier() in the simulated
> code
> > >> >>but
> > >> >> >> >>that
> > >> >> >> >>may not happen at the same tick in each gem5 (due to load
> > >>imbalance
> > >> >> >> >>among
> > >> >> >> >>the simulated nodes). That’s why multi-gem5 implements the
> > >> >> >> >>non-synchronous
> > >> >> >> >>checkpoint support.
> > >> >> >> >>
> > >> >> >> >>My answers to your questions are as follows.
> > >> >> >> >>
> > >> >> >> >>1. The only change necessary to use multi-gem5 with a non
> > >>Ethernet
> > >> >> >> >>(simulated) network is to replace the Ethernet packet type
> with
> > >> >> >>another
> > >> >> >> >>packet type in MultiIface.
> > >> >> >> >>In fact, the first implementation of MultiIface was a template
> > >> >> >> >>that took EthPacketData as parameter because I plan to support
> > >> >> >>different
> > >> >> >> >>network types. When I realized that currently only Ethernet is
> > >> >> >>supported
> > >> >> >> >>by gem5 I dropped the template param to keep the
> implementation
> > >> >> >> >>simpler. I
> > >> >> >> >>have also realized in the meantime that the right approach
> would
> > >> >> >> >>probably
> > >> >> >> >>be to create a pure virtual ‘base' class for network packets
> > >>from
> > >> >> >>which
> > >> >> >> >>Ethernet (and other types of) packets could be derived. Then
> > >> >> >>MultiIface
> > >> >> >> >>could simply use that base class to provide support for
> > >>different
> > >> >> >> >>network
> > >> >> >> >>types. The interface provided by the base packet class could
> be
> > >> >>very
> > >> >> >> >>simple. Beside the total size() of the packet, multi-gem5 only
> > >> >>needs a
> > >> >> >> >>method to ‘extract' the source/destination address. Those
> > >>addresses
> > >> >> >>are
> > >> >> >> >>used in MultiIface as opaque byte arrays so they are quite
> > >>network
> > >> >> >>type
> > >> >> >> >>agnostic already.
> > >> >> >> >>
> > >> >> >> >>2. That’s right, we have designed the MultiIface/TCPIface
> split
> > >> >>with
> > >> >> >> >>different underlaying messaging systems in mind.
> > >> >> >> >>
> > >> >> >> >>3. Multi-gem5 can work together with
> > >> >>multi-threaded/multi-event-queue
> > >> >> >> >>gem5
> > >> >> >> >>configs. The current TCPIface/tcp_server components would
> still
> > >>use
> > >> >> >> >>sockets to send around the packets. So it is possible to put
> > >> >>together
> > >> >> >>a
> > >> >> >> >>multi-gem5 simulation where each gem5 process has multiple
> event
> > >> >> >>queues
> > >> >> >> >>(and an independent simulation thread per event queue) but all
> > >>the
> > >> >> >> >>simulated Ethernet links would use sockets to forward every
> > >> >>Ethernet
> > >> >> >> >>packet to the tcp_server.
> > >> >> >> >>
> > >> >> >> >>If someone wanted to run only a single gem5 process to
> simulate
> > >>an
> > >> >> >> >>entire
> > >> >> >> >>cluster (using one thread/event-queue per cluster node) then
> the
> > >> >> >>current
> > >> >> >> >>multi-gem5 implementation using sockets/tcp_server is not
> > >>optimal.
> > >> >>In
> > >> >> >> >>that
> > >> >> >> >>case,  a better solution would be to provide a shared memory
> > >>based
> > >> >> >> >>implementation of the MultiIface virtual communication methods
> > >> >> >> >>sendRaw()/recvRaw()/syncRaw() (i.e. a shared memory equivalent
> > >>of
> > >> >> >> >>TCPIface). In that implementation, the entire discrete
> tcp_sever
> > >> >> >> >>component
> > >> >> >> >>could be replaced with a shared data structure.
> > >> >> >> >>
> > >> >> >> >>4. You are right, the current implementation does not make it
> > >> >>possible
> > >> >> >> >>to
> > >> >> >> >>construct an equivalent single-process simulation model for a
> > >> >> >>multi-gem5
> > >> >> >> >>run. However, a possible solution is a shared memory based
> > >> >> >> >>implementation
> > >> >> >> >>of the MultiIface virtual communication methods just as I
> > >> >>described in
> > >> >> >> >>the
> > >> >> >> >>previous paragraph. The same implementation could then work
> with
> > >> >>both
> > >> >> >> >>multi-threaded/multi-event-queues and
> > >> >>single-thread/single-event-queue
> > >> >> >> >>gem5 configs.
> > >> >> >> >>
> > >> >> >> >>Thanks,
> > >> >> >> >>- Gabor
> > >> >> >> >>
> > >> >> >> >>On 7/2/15, 7:20 PM, "Steve Reinhardt" <ste...@gmail.com>
> wrote:
> > >> >> >> >>
> > >> >> >> >>>Hi everyone,
> > >> >> >> >>>
> > >> >> >> >>>Sorry for taking so long to engage. This is a great
> development
> > >> >>and I
> > >> >> >> >>>think
> > >> >> >> >>>both these patches are terrific contributions. Thanks to
> > >>Mohammad,
> > >> >> >> >>Gabor,
> > >> >> >> >>>and everyone else involved.
> > >> >> >> >>>
> > >> >> >> >>>I agree with Andreas that we should start with some top-level
> > >> >>goals &
> > >> >> >> >>>assumptions, agree on those, and then we can sort out the
> > >>detailed
> > >> >> >> >>issues
> > >> >> >> >>>based on a consistent view.
> > >> >> >> >>>
> > >> >> >> >>>I definitely agree with Andreas's first two points. The third
> > >>one
> > >> >> >> >>seems a
> > >> >> >> >>>little surprising; I'd like to hear more about the motivation
> > >> >>before
> > >> >> >> >>>expressing an opinion. I can see where non-synchronous
> > >> >>checkpointing
> > >> >> >> >>could
> > >> >> >> >>>be useful, but it's also clear from the associated patch that
> > >>it's
> > >> >> >>not
> > >> >> >> >>>trivial to implement either. How much would be lost by
> > >>requiring a
> > >> >> >> >>>synchronization before a checkpoint?
> > >> >> >> >>>
> > >> >> >> >>>From my personal perspective, I would like to see whatever we
> > >>do
> > >> >>here
> > >> >> >> >>be a
> > >> >> >> >>>first step toward a more general distributed simulation
> > >>platform.
> > >> >> >>Both
> > >> >> >> >>of
> > >> >> >> >>>these patches seem pretty Ethernet-centric in different ways.
> > >> >>This is
> > >> >> >> >>not
> > >> >> >> >>>terrible; part of the problem is that gem5's current internal
> > >> >> >> >>networking
> > >> >> >> >>>support is already overly Ethernet-centric IMO. But it would
> be
> > >> >>nice
> > >> >> >>to
> > >> >> >> >>>avoid baking that in even further. Rather than assume I have
> > >> >> >>understood
> > >> >> >> >>>all
> > >> >> >> >>>the code completely, I'll phrase things in the form of
> > >>questions,
> > >> >>and
> > >> >> >> >>>people can comment on how those questions would be answered
> in
> > >>the
> > >> >> >> >>context
> > >> >> >> >>>of the two different approaches.
> > >> >> >> >>>
> > >> >> >> >>>1. How much effort would be required to simulate a
> non-Ethernet
> > >> >> >> >>network?
> > >> >> >> >>>My
> > >> >> >> >>>impression is that pd-gem5 has a leg up here, since a gem5
> > >>switch
> > >> >> >>model
> > >> >> >> >>>for
> > >> >> >> >>>a non-Ethernet network (which you'd have to write anyway if
> you
> > >> >>were
> > >> >> >> >>>simulating a different network) could be used in place of the
> > >> >>current
> > >> >> >> >>>Ethernet switch, where for multi-gem5 I think that the
> > >> >> >> >>>util/multi//tcp_server.cc code would have to be modified
> (i.e.,
> > >> >> >> >>there'd be
> > >> >> >> >>>additional work above and beyond what you'd need to get the
> > >> >>network
> > >> >> >> >>>modeled
> > >> >> >> >>>in base gem5).
> > >> >> >> >>>
> > >> >> >> >>>2. How much effort is required to run on a non-Ethernet
> network
> > >> >>(or
> > >> >> >> >>>equivalently using a non-sockets API)?  The
> MultiIface/TCPIface
> > >> >>split
> > >> >> >> >>in
> > >> >> >> >>>the multi-gem5 code looks like it addresses this nicely, but
> > >> >>pd-gem5
> > >> >> >> >>seems
> > >> >> >> >>>pretty tied to an Ethernet host fabric.
> > >> >> >> >>>
> > >> >> >> >>>3. Do both of these patches work with the existing
> > >>multithreaded
> > >> >> >> >>>multiple-event-queue simulation? I think multi-gem5 does
> > >>(though
> > >> >>it
> > >> >> >> >>would
> > >> >> >> >>>be nice to have a confirmation), but it's not clear about
> > >> >>pd-gem5. I
> > >> >> >> >>don't
> > >> >> >> >>>see a benefit to having multiple gem5 processes on a single
> > >>host
> > >> >>vs.
> > >> >> >>a
> > >> >> >> >>>single multithreaded gem5 process using the existing
> support. I
> > >> >>think
> > >> >> >> >>this
> > >> >> >> >>>could be particularly valuable with a hierarchical network;
> > >>e.g.,
> > >> >> >> >>maybe I
> > >> >> >> >>>would want to model a rack in multithreaded mode on a single
> > >> >> >>multicore
> > >> >> >> >>>server, then use pd-gem5 or multi-gem5 to build up a
> > >>simulation of
> > >> >> >> >>>multiple
> > >> >> >> >>>racks. Would this work out of the box with either of these
> > >> >>patches,
> > >> >> >> >>and if
> > >> >> >> >>>not, what would need to be done?
> > >> >> >> >>>
> > >> >> >> >>>4. Is it possible to construct a single-process simulation
> > >>model
> > >> >> >>that's
> > >> >> >> >>>identical to the distributed simulation? It would be very
> > >>valuable
> > >> >> >>for
> > >> >> >> >>>verification to be able to take a single simulation run and
> do
> > >>it
> > >> >> >>both
> > >> >> >> >>>within a single process and also across multiple processes
> and
> > >> >>verify
> > >> >> >> >>that
> > >> >> >> >>>identical results are achieved. This seems like a big
> drawback
> > >>to
> > >> >>the
> > >> >> >> >>>multi-gem5 tcp_server approach, IMO.
> > >> >> >> >>>
> > >> >> >> >>>I'm definitely not saying that all these issues need to be
> > >> >>resolved
> > >> >> >> >>before
> > >> >> >> >>>anything gets committed, but if we can agree that these are
> > >>valid
> > >> >> >> >>goals,
> > >> >> >> >>>then we can evaluate detailed issues based on whether they
> > >>move us
> > >> >> >> >>toward
> > >> >> >> >>>or away from those goals.
> > >> >> >> >>>
> > >> >> >> >>>Thanks,
> > >> >> >> >>>
> > >> >> >> >>>Steve
> > >> >> >> >>>
> > >> >> >> >>>
> > >> >> >> >>>On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson
> > >> >> >> >><andreas.hans...@arm.com>
> > >> >> >> >>>wrote:
> > >> >> >> >>>
> > >> >> >> >>>>Hi all,
> > >> >> >> >>>>
> > >> >> >> >>>>I think we need to up-level this a bit. From our perspective
> > >> >>(and I
> > >> >> >> >>>>suspect in general):
> > >> >> >> >>>>
> > >> >> >> >>>>1. Robustness is important. Having a design that _may_
> break,
> > >> >> >>however
> > >> >> >> >>>>unlikely is simply not an option.
> > >> >> >> >>>>
> > >> >> >> >>>>2. Performance and scaling is important. We can compare
> actual
> > >> >> >>numbers
> > >> >> >> >>>>here, and I am fairly sure the two solutions are on par.
> Let’s
> > >> >> >> >>quantify
> > >> >> >> >>>>that though.
> > >> >> >> >>>>
> > >> >> >> >>>>3. Checkpointing must not rely on synchronicity. It is vital
> > >>for
> > >> >> >> >>several
> > >> >> >> >>>>workloads that we can checkpoint the various gem5 instances
> at
> > >> >> >> >>different
> > >> >> >> >>>>Ticks (due to the way the workloads are constructed).
> > >> >> >> >>>>
> > >> >> >> >>>>Andreas
> > >> >> >> >>>>
> > >> >> >> >>>>On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
> > >> >> >> >>>><gem5-dev-boun...@gem5.org on behalf of al...@wisc.edu>
> > wrote:
> > >> >> >> >>>>
> > >> >> >> >>>>>Thanks Gabor for the reply.
> > >> >> >> >>>>>
> > >> >> >> >>>>>I feel this conversation is useful as we can find out
> > >>pros/cons
> > >> >>of
> > >> >> >> >>each
> > >> >> >> >>>>>design.
> > >> >> >> >>>>>Please find my response in-lined below.
> > >> >> >> >>>>>
> > >> >> >> >>>>>Thank you,
> > >> >> >> >>>>>Mohammad
> > >> >> >> >>>>>
> > >> >> >> >>>>>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa
> > >> >><gabor.do...@arm.com>
> > >> >> >> >>>>wrote:
> > >> >> >> >>>>>
> > >> >> >> >>>>>>Hi All,
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Sorry for the missing indentation in my previous e-mail!
> > >>(This
> > >> >>was
> > >> >> >> >>my
> > >> >> >> >>>>>>first e-mail to the dev-list so I could not simply use
> > >> >>“reply").
> > >> >> >> >>>>Below
> > >> >> >> >>>>>>is
> > >> >> >> >>>>>>the same message, hopefully in more readable form.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>====================================
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Hi  All,
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Thank you Mohammad for your elaboration on the issues!
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>I have written most of the multi-gem5 patch so let me add
> > >>some
> > >> >> >>more
> > >> >> >> >>>>>>clarifications  and answer to your concerns. My comments
> are
> > >> >> >>inline
> > >> >> >> >>>>>>below.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Thanks,
> > >> >> >> >>>>>>- Gabor
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>On 6/27/15, 10:20 AM, "Mohammad Alian" <al...@wisc.edu>
> > >>wrote:
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>>Hi All,
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>Curtis-Thank you for listing some of the differences. I
> was
> > >> >> >> >>waiting
> > >> >> >> >>>>for
> > >> >> >> >>>>>>>the
> > >> >> >> >>>>>>>completed multi-gem5 patch before I send my review.
> Please
> > >> >>see my
> > >> >> >> >>>>>>inline
> > >> >> >> >>>>>>>response below. I¹ve addressed the concerns that you¹ve
> > >> >>raised.
> > >> >> >> >>>>Also,
> > >> >> >> >>>>>>I¹ve
> > >> >> >> >>>>>>>added a bit more to the comparison.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>-*  Synchronization.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>pd-gem5 implements this in Python (not a problem in
> itself;
> > >> >> >> >>>>>>aesthetically
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>this is nice, but...).  The issue is that pd-gem5's data
> > >> >>packets
> > >> >> >> >>and
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>barrier messages travel over different sockets.  Since
> > >>pd-gem5
> > >> >> >> >>could
> > >> >> >> >>>>>>see
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>data packets passing synchronization barriers, it could
> > >> >>create an
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>inconsistent checkpoint.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>multi-gem5's synchronization is implemented in C++ using
> > >>sync
> > >> >> >> >>>>events,
> > >> >> >> >>>>>>but
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>more importantly, the messages queue up in the same
> stream
> > >> >>and so
> > >> >> >> >>>>>>cannot
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>have the issue just described.  (Event ordering is often
> > >> >>crucial
> > >> >> >> >>in
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>snapshot protocols.) Therefore we feel that multi-gem5
> is a
> > >> >>more
> > >> >> >> >>>>robust
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>solution in this respect.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>Each packet in pd-gem5 has a time-stamp. So even if data
> > >> >>packets
> > >> >> >> >>>>pass
> > >> >> >> >>>>>>>synchronization barriers (in another word data packets
> > >>arrive
> > >> >> >> >>early
> > >> >> >> >>>>at
> > >> >> >> >>>>>>the
> > >> >> >> >>>>>>>destination node), destination node process packets based
> > >>on
> > >> >> >>their
> > >> >> >> >>>>>>>timestamp. Actually allowing data packets to pass sync
> > >> >>barriers
> > >> >> >> >>is a
> > >> >> >> >>>>>>nice
> > >> >> >> >>>>>>>feature that can reduce the likelihood of late packet
> > >> >>reception.
> > >> >> >> >>>>>>Ordering
> > >> >> >> >>>>>>>of data messages that flow over pd-gem5 nodes is also
> > >> >>preserved
> > >> >> >>in
> > >> >> >> >>>>>>pd-gem5
> > >> >> >> >>>>>>>implementation.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>This seems to be a misunderstanding. Maybe the wording was
> > >>not
> > >> >> >> >>>>precise
> > >> >> >> >>>>>>before.The problem is not a data packet that “passing" a
> > >>sync
> > >> >> >> >>barrier
> > >> >> >> >>>>>>but the other way around, a sync barrier that can pass a
> > >>data
> > >> >> >> >>packet
> > >> >> >> >>>>>>(e.g. while the data packet is waiting in the host
> operating
> > >> >> >>system
> > >> >> >> >>>>>>socket layer).  If that happens, the packet will arrive
> > >>later
> > >> >>than
> > >> >> >> >>it
> > >> >> >> >>>>>>was
> > >> >> >> >>>>>>supposed to and it may miss the computed receive tick.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>For instance, let’s assume that the quantum coincides with
> > >>the
> > >> >> >> >>>>simulated
> > >> >> >> >>>>>>Ether link delay. (This is the optimal choice of quantum
> to
> > >> >> >> >>minimize
> > >> >> >> >>>>the
> > >> >> >> >>>>>>number of sync barriers.)  If a data packet is sent right
> at
> > >> >>the
> > >> >> >> >>>>>>beginning
> > >> >> >> >>>>>>of a quantum then this packet must arrive at the
> destination
> > >> >>gem5
> > >> >> >> >>>>>>process
> > >> >> >> >>>>>>within the same quantum in order not to miss its receive
> > >>tick
> > >> >>at
> > >> >> >> >>the
> > >> >> >> >>>>>>very
> > >> >> >> >>>>>>beginning of the next quantum. If the sync barrier can
> pass
> > >>the
> > >> >> >> >>data
> > >> >> >> >>>>>>packet
> > >> >> >> >>>>>>then the data packet may arrive only during the next
> quantum
> > >> >>(or
> > >> >> >> >>in
> > >> >> >> >>>>>>extreme conditions even later than that) so when it
> arrives
> > >>the
> > >> >> >> >>>>receiver
> > >> >> >> >>>>>>gem5 may pass already the receive tick.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>This argument makes more sense than the previous one. Note
> > >>that
> > >> >> >> >>gem5
> > >> >> >> >>>>is
> > >> >> >> >>>>>>a
> > >> >> >> >>>>>cycle accurate simulator and it runs orders of magnitude
> > >>slower
> > >> >> >>that
> > >> >> >> >>>>real
> > >> >> >> >>>>>hardware. So it's almost impossible that the flight time of
> > >> >>packet
> > >> >> >> >>>>through
> > >> >> >> >>>>>real network turns to be more that simulation time of one
> > >> >>quantum.
> > >> >> >>We
> > >> >> >> >>>>ran
> > >> >> >> >>>>>a
> > >> >> >> >>>>>set of experiments just for this purpose: with quantum size
> > >> >>equal
> > >> >> >>to
> > >> >> >> >>>>>etherlink delay, we never got any late arrival violation
> > >>(what
> > >> >>you
> > >> >> >> >>>>>described) for full NAS benchmarks suit (please refer to
> the
> > >> >> >>paper).
> > >> >> >> >>>>>
> > >> >> >> >>>>>multi-gem5 is optimized for a case that almost never
> happens!
> > >> >>and
> > >> >> >> >>>>>scarifying speedup for no gain.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>Time-stamping does help with this issue. Also, if a data
> > >> >>packet is
> > >> >> >> >>>>>>waiting
> > >> >> >> >>>>>>in the host operating system socket layer when the
> > >>simulation
> > >> >> >> >>thread
> > >> >> >> >>>>>>exits
> > >> >> >> >>>>>>to python to complete the next sync barrier  then the
> packet
> > >> >>will
> > >> >> >> >>>>not go
> > >> >> >> >>>>>>into the checkpoint that may follow that sync barrier.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>That's a good point. Current pd-gem5 checkpointing
> mechanism
> > >> >>might
> > >> >> >> >>>>miss
> > >> >> >> >>>>>packets that have been sent during previous quantum and are
> > >> >>waiting
> > >> >> >> >>in
> > >> >> >> >>>>OS
> > >> >> >> >>>>>socket buffer. I should add some code inside ethertap
> > >> >>serialization
> > >> >> >> >>>>>function to drain ethertap socket before writing
> checkpoint.
> > >>I
> > >> >>will
> > >> >> >> >>>>update
> > >> >> >> >>>>>pd-gem5 patch accordingly.
> > >> >> >> >>>>>
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>>What you mentioned as an advantage for multi-gem5 is
> > >>actually
> > >> >>a
> > >> >> >> >>key
> > >> >> >> >>>>>>>disadvantage: buffering sync messages behind data packets
> > >>can
> > >> >>add
> > >> >> >> >>>>up to
> > >> >> >> >>>>>>>the
> > >> >> >> >>>>>>>synchronization overhead and slow down simulation
> > >> >>significantly.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>The purpose of sync messages is to make sure that the data
> > >> >>packets
> > >> >> >> >>>>>>arrive
> > >> >> >> >>>>>>in time (in terms of simulated time) at the destination so
> > >>they
> > >> >> >>can
> > >> >> >> >>>>be
> > >> >> >> >>>>>>scheduled for being received at the proper computed tick.
> > >>Sync
> > >> >> >> >>>>messages
> > >> >> >> >>>>>>also make sure that no data packets are in flight when a
> > >>sync
> > >> >> >> >>barrier
> > >> >> >> >>>>>>completes before we take a checkpoint.  They definitely
> add
> > >> >> >> >>overhead
> > >> >> >> >>>>for
> > >> >> >> >>>>>>the simulation but they are necessary for the correctness
> of
> > >> >>the
> > >> >> >> >>>>>>simulation.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>The receive thread in multi-gem5 reads out packets from
> the
> > >> >>socket
> > >> >> >> >>in
> > >> >> >> >>>>>>parallel with the simulation thread so packets normally
> will
> > >> >>not
> > >> >> >>be
> > >> >> >> >>>>>>"queueing up” before a sync barrier message.  There is
> > >> >>definitely
> > >> >> >> >>>>room
> > >> >> >> >>>>>>for improvements in the current implementation for
> reducing
> > >>the
> > >> >> >> >>>>>>synchronization overhead but that is likely true for
> > >>pd-gem5,
> > >> >>too.
> > >> >> >> >>>>>>The important thing here is that the solution must provide
> > >> >> >> >>>>correctness
> > >> >> >> >>>>>>(robustness) first.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>pd-gem5 provides correctness. Please read my previous
> > >>comment.
> > >> >>The
> > >> >> >> >>>>whole
> > >> >> >> >>>>>purpose of multi/pd-gem5 is to parallelize simulation with
> > >> >>minimal
> > >> >> >> >>>>>overhead
> > >> >> >> >>>>>and gain speedup. If you fail to do so, nobody will use
> your
> > >> >>tool.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>Also,
> > >> >> >> >>>>>>>multi-gem5 send huge sized messages (multiHeaderPkt)
> > >>through
> > >> >> >> >>>>network to
> > >> >> >> >>>>>>>perform each synchronization point, which increases
> > >> >> >> >>synchronization
> > >> >> >> >>>>>>>overhead further. In pd-gem5, we choose to send just one
> > >> >> >>character
> > >> >> >> >>>>as
> > >> >> >> >>>>>>sync
> > >> >> >> >>>>>>>message through a separate socket to reduce
> synchronization
> > >> >> >> >>>>overhead.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>The TCP/IP message size is unlikely the bottleneck here.
> > >> >> >>Multi-gem5
> > >> >> >> >>>>will
> > >> >> >> >>>>>>send ~50 bytes more in a sync barrier message than pd-gem5
> > >>but
> > >> >> >>that
> > >> >> >> >>>>>>bigger
> > >> >> >> >>>>>>sync message still fits into a single ethernet frame on
> the
> > >> >>wire.
> > >> >> >> >>The
> > >> >> >> >>>>>>end-to-end latency overhead that is caused by 50 bytes
> extra
> > >> >> >> >>payload
> > >> >> >> >>>>for
> > >> >> >> >>>>>>a small single frame TCP/IP message is likely to fall into
> > >>the
> > >> >> >> >>>>“noise"
> > >> >> >> >>>>>>category if one tries to measure it in a real cluster.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>You should prove your hypothesis experimentally. Each gem5
> > >> >>process
> > >> >> >> >>>>>send/receive sync messages at the end of every quantum. Say
> > >>you
> > >> >>are
> > >> >> >> >>>>>simulating "N" node computer cluster with "M" different
> > >> >> >> >>configuration.
> > >> >> >> >>>>>Then
> > >> >> >> >>>>>you will have N*M gem5 processes that send/receive these 50
> > >> >>Bytes
> > >> >> >>(it
> > >> >> >> >>>>>think
> > >> >> >> >>>>>it's more) extra data at the same time over network ...
> > >> >> >> >>>>>
> > >> >> >> >>>>>Furthermore, multi-gem5 send a header before each data
> > >>message.
> > >> >> >> >>>>Comparing
> > >> >> >> >>>>>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is
> > >>12
> > >> >> >>least
> > >> >> >> >>>>>significant digits of the Tick) to each data packet. I
> don't
> > >> >>know
> > >> >> >> >>>>exactly
> > >> >> >> >>>>>how large are these "MultiHeaderPkt", but it just has two
> > >>Tick
> > >> >> >>field
> > >> >> >> >>>>that
> > >> >> >> >>>>>each is 64 Bytes! Also, header packets are separate TCP
> > >> >>packets, so
> > >> >> >> >>you
> > >> >> >> >>>>>pay
> > >> >> >> >>>>>for sending two separate packets for each data packet. And
> > >> >>worst,
> > >> >> >>you
> > >> >> >> >>>>>serialize all of these with sync messages.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>*  Packet handling.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>pd-gem5 uses EtherTap for data packets but changed the
> > >>polling
> > >> >> >> >>>>>>mechanism
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>to go through the main event queue.  Since this rate is
> > >> >>actually
> > >> >> >> >>>>linked
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>with simulator progress, it cannot guarantee that the
> > >>packets
> > >> >>are
> > >> >> >> >>>>>>>serviced
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>at regular intervals of real time.  This can lead to
> > >>packets
> > >> >> >> >>>>queueing
> > >> >> >> >>>>>>up
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>which would contribute to the synchronization issues
> > >>mentioned
> > >> >> >> >>>>above.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>multi-gem5 uses plain sockets with separate receive
> threads
> > >> >>and
> > >> >> >>so
> > >> >> >> >>>>does
> > >> >> >> >>>>>>>not
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>have this issue.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>I think again you are pointing to your first concern that
> > >>I¹ve
> > >> >> >> >>>>>>explained
> > >> >> >> >>>>>>>above. Packets that have queued up in EtherTap socket,
> > >>will be
> > >> >> >> >>>>>>processed
> > >> >> >> >>>>>>>and delivered to simulation environment at the beginning
> of
> > >> >>next
> > >> >> >> >>>>>>>simulation
> > >> >> >> >>>>>>>quantum.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>Please notice that multi-gem5 introduces a new simObjects
> > >>to
> > >> >> >> >>>>interface
> > >> >> >> >>>>>>>simulation environment to real world which is redundant.
> > >>This
> > >> >> >> >>>>>>>functionality
> > >> >> >> >>>>>>>is already there by EtherTap.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Except that the EtherTap solution does not provide a
> correct
> > >> >> >> >>(robust)
> > >> >> >> >>>>>>solution for the synchronization problem.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Please read my first/second comments.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>* Checkpoint accuracy.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>A user would like to have a checkpoint at precisely the
> > >>time
> > >> >>the
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>'m5 checkpoint' operation is executed so as to not miss
> > >>any of
> > >> >> >>the
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>area of interest in his application.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>pd-gem5 requires that simulation finish the current
> quantum
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>before checkpointing, so it cannot provide this.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>(Shortening the quantum can help, but usually the
> snapshot
> > >>is
> > >> >> >> >>being
> > >> >> >> >>>>>>taken
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>while 'fast-forwarding', i.e. simulating as fast as
> > >>possible,
> > >> >> >> >>which
> > >> >> >> >>>>>>would
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>motivate a longer quantum.)
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>multi-gem5 can enter the drain cycle immediately upon
> > >> >>receiving a
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>checkpoint request.  We find this accuracy highly
> > >>desirable.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>It¹s true that if you have a large quantum size then
> there
> > >> >>would
> > >> >> >> >>be
> > >> >> >> >>>>>>some
> > >> >> >> >>>>>>>discrepancy between the m5_ckpt instruction tick and the
> > >> >>actual
> > >> >> >> >>dump
> > >> >> >> >>>>>>tick.
> > >> >> >> >>>>>>>Based on multi-gem5 code, my understanding is that you
> send
> > >> >>async
> > >> >> >> >>>>>>>checkpoint message as soon as one of the gem5 processes
> > >> >>encounter
> > >> >> >> >>>>>>m5_ckpt
> > >> >> >> >>>>>>>instruction. But I¹m not sure how you fix the
> > >>aforementioned
> > >> >> >> >>issue,
> > >> >> >> >>>>>>>because
> > >> >> >> >>>>>>>you have to sync all gem5 processes before you start
> > >>dumping
> > >> >> >> >>>>>>checkpoint,
> > >> >> >> >>>>>>>which necessitate a global synchronization beforehand.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>In multi-gem5, the gem5 process who encounters the m5_ckpt
> > >> >> >> >>>>instruction
> > >> >> >> >>>>>>sends out an async checkpoint notification for the peer
> gem5
> > >> >> >> >>>>processes
> > >> >> >> >>>>>>and
> > >> >> >> >>>>>>then it starts the draining immediately (at the same
> tick).
> > >> So
> > >> >> >>the
> > >> >> >> >>>>>>checkpoint will be taken at the exact tick form the
> > >>initiator
> > >> >> >> >>process
> > >> >> >> >>>>>>point of view. The global synchronisation with the peer
> > >> >>processes
> > >> >> >> >>>>takes
> > >> >> >> >>>>>>place while the initiator process is still waiting at the
> > >>same
> > >> >> >>tick
> > >> >> >> >>>>(i.e
> > >> >> >> >>>>>>the simulation thread is suspended). However,  the
> receiver
> > >> >>thread
> > >> >> >> >>>>>>Continues reading out the socket - while waiting for the
> > >>global
> > >> >> >> >>sync
> > >> >> >> >>>>to
> > >> >> >> >>>>>>complete- to make sure that in-flight data packets from
> peer
> > >> >>gem5
> > >> >> >> >>>>>>processes
> > >> >> >> >>>>>>are stored properly and saved into the checkpoint.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>
> > >> >> >> >>>>>So you mean multi-gem5 ends up with having gem5 processes
> > >>with
> > >> >> >> >>>>different
> > >> >> >> >>>>>ticks after checkpoint? In pd-gem5 we make sure that all
> gem5
> > >> >> >> >>processes
> > >> >> >> >>>>>start dumping checkpoint at the same tick. Are you sure
> that
> > >> >>this
> > >> >> >>is
> > >> >> >> >>>>>correct to have each gem5 process dump checkpoint at
> > >>different
> > >> >> >> >>ticks???
> > >> >> >> >>>>>
> > >> >> >> >>>>>I don't think this a correct checkpointing design. However,
> > >>if
> > >> >>you
> > >> >> >> >>>>feel it
> > >> >> >> >>>>>is correct, I can change a couple of lines in
> "Simulation.py"
> > >> >>and
> > >> >> >> >>>>barrier
> > >> >> >> >>>>>scripts to implement the same functionality in pd-gem5. One
> > >> >>thing
> > >> >> >> >>that
> > >> >> >> >>>>you
> > >> >> >> >>>>>are obsessed about is to make sure that there is no
> in-flight
> > >> >> >>packets
> > >> >> >> >>>>>while
> > >> >> >> >>>>>we start dumping checkpoint, and you have all these complex
> > >> >> >> >>mechanisms
> > >> >> >> >>>>in
> > >> >> >> >>>>>place to insure that! I think you can 99.99999% make sure
> > >>that
> > >> >> >>there
> > >> >> >> >>>>is no
> > >> >> >> >>>>>in-flight packet by waiting for 1 second after all gem5
> > >> >>processes
> > >> >> >> >>>>finished
> > >> >> >> >>>>>their quantum simulation and then dump checkpoint. Do you
> > >>really
> > >> >> >> >>think
> > >> >> >> >>>>>that
> > >> >> >> >>>>>delivering a tcp packet would take more than 1 second in
> > >>today's
> > >> >> >> >>>>systems!?
> > >> >> >> >>>>>Always go for simple solutions ...
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>By the way, we have a fix for this issue by introducing a
> > >>new
> > >> >>m5
> > >> >> >> >>>>pseudo
> > >> >> >> >>>>>>>instruction.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>I fail to see how a new pseudo instruction can solve the
> > >> >>problem
> > >> >> >>of
> > >> >> >> >>>>>>completing the full quantum in pd-gem5 before a checkpoint
> > >>can
> > >> >>be
> > >> >> >> >>>>taken.
> > >> >> >> >>>>>>Could you please elaborate on that?
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>As we take checkpoint while fast-forwarding and it is
> likely
> > >> >>that
> > >> >> >> >>we
> > >> >> >> >>>>>>relax
> > >> >> >> >>>>>synchronization for speedup purpose, a new pseudo
> instruction
> > >> >>that
> > >> >> >> >>can
> > >> >> >> >>>>set
> > >> >> >> >>>>>quantum size (m5_qset) can be helpful. So, one can insert
> > >> >>m5_qset
> > >> >> >>in
> > >> >> >> >>>>his
> > >> >> >> >>>>>benchmark source code before entering ROI that contains
> > >>m5_ckpt
> > >> >>to
> > >> >> >> >>>>>decrease
> > >> >> >> >>>>>quantum size beforehand and reduce the discrepancy between
> > >> >>m5_ckpt
> > >> >> >> >>tick
> > >> >> >> >>>>>and
> > >> >> >> >>>>>actual checkpoint tick. This is not included in pd-gem5
> patch
> > >> >>right
> > >> >> >> >>>>now.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>* Implementation of network topology.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>pd-gem5 uses a separate gem5 process to act as a switch
> > >> >>whereas
> > >> >> >> >>>>>>multi-gem5
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>uses a standalone packet relay process.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>We haven't measured the overhead of pd-gem5's simulated
> > >>switch
> > >> >> >> >>yet,
> > >> >> >> >>>>but
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>we're confident that our approach is at least as fast and
> > >>more
> > >> >> >> >>>>>>scalable.
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>There is this flexibility in pd-gem5 to simulate a switch
> > >>box
> > >> >> >> >>>>alongside
> > >> >> >> >>>>>>>one
> > >> >> >> >>>>>>>of the other gem5 processes. However, it might make that
> > >>gem5
> > >> >> >> >>>>process
> > >> >> >> >>>>>>the
> > >> >> >> >>>>>>>simulation bottleneck. One of the advantages of pd-gem5
> > >>over
> > >> >> >> >>>>>>multi-gem5 is
> > >> >> >> >>>>>>>that we use gem5 to simulate a switch box, which allows
> us
> > >>to
> > >> >> >> >>model
> > >> >> >> >>>>any
> > >> >> >> >>>>>>>network topology by instantiating several Switch
> simObjects
> > >> >>and
> > >> >> >> >>>>>>>interconnect them with EhterLink in an arbitrary
> fashion. A
> > >> >> >> >>>>standalone
> > >> >> >> >>>>>>tcp
> > >> >> >> >>>>>>>server just can provide switch functionality (forwarding
> > >> >>packets
> > >> >> >> >>to
> > >> >> >> >>>>>>>destinations) and model a star network topology.
> > >>Furthermore,
> > >> >>it
> > >> >> >> >>>>cannot
> > >> >> >> >>>>>>>model various network timings such as queueing delay,
> > >> >>congestion,
> > >> >> >> >>>>and
> > >> >> >> >>>>>>>routing latency. Also it has some accuracy issues that I
> > >>will
> > >> >> >> >>point
> > >> >> >> >>>>out
> > >> >> >> >>>>>>>next.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>I agree with the complex topology argument. We already
> > >> >>mentioned
> > >> >> >> >>that
> > >> >> >> >>>>>>before as an advantage for pd-gem5 from the point of view
> of
> > >> >> >>future
> > >> >> >> >>>>>>extensions. However, I do not agree that multi-gem5 cannot
> > >> >>model
> > >> >> >> >>>>>>queueing
> > >> >> >> >>>>>>delays and congestions. For a simple crossbar switch, it
> can
> > >> >>model
> > >> >> >> >>>>>>queueing
> > >> >> >> >>>>>>delays and congestions, but the receive queues are
> > >>distributed
> > >> >> >> >>among
> > >> >> >> >>>>the
> > >> >> >> >>>>>>gem5 processes.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>It's true that you can model queuing delay of a simple
> > >> >>crossbar by
> > >> >> >> >>>>>distributing queues across gem5 processes (end points). But
> > >>to
> > >> >>be
> > >> >> >> >>able
> > >> >> >> >>>>to
> > >> >> >> >>>>>do so you have to ensure the ordering of packets that you
> > >> >>enqueue
> > >> >> >>in
> > >> >> >> >>>>the
> > >> >> >> >>>>>distributed queues. It is almost impossible without a
> > >> >>synchronized
> > >> >> >> >>>>switch
> > >> >> >> >>>>>box. You should have a reorder queue that reorders packets
> > >> >> >> >>dynamically
> > >> >> >> >>>>and
> > >> >> >> >>>>>updates timing parameter for each packet as well. I don't
> > >>know
> > >> >>how
> > >> >> >> >>much
> > >> >> >> >>>>>progress have you had to ensure ordering scheme in
> multi-gem5
> > >> >>but
> > >> >> >>you
> > >> >> >> >>>>may
> > >> >> >> >>>>>already realized that how complex and error prone it can
> be.
> > >> >>This
> > >> >> >> >>>>argument
> > >> >> >> >>>>>is also related to my next argument for "Broken network
> > >>timing".
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>* Broken network timing:
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>Forwarding packets between gem5 processes using a
> > >>standalone
> > >> >>tcp
> > >> >> >> >>>>server
> > >> >> >> >>>>>>>can
> > >> >> >> >>>>>>>cause reordering between packets that have different
> source
> > >> >>but
> > >> >> >> >>same
> > >> >> >> >>>>>>>destination. It causes  inaccurate network timing and
> > >>worse of
> > >> >> >>all
> > >> >> >> >>>>>>>non-deterministic simulation. pd-gem5 resolve this by
> > >> >>reordering
> > >> >> >> >>>>>>packets
> > >> >> >> >>>>>>>at
> > >> >> >> >>>>>>>Switch process and then send them to their destination
> > >>(it¹s
> > >> >> >> >>>>possible
> > >> >> >> >>>>>>as
> > >> >> >> >>>>>>>switch is synchronized with the rest of the nodes).
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>In multi-gem5, there is always a HeaderPkt that contains
> > >>some
> > >> >>meta
> > >> >> >> >>>>>>information for each data packet. The meta information
> > >>include
> > >> >>the
> > >> >> >> >>>>send
> > >> >> >> >>>>>>tick and the sender rank (i.e. a  unique ID of the sender
> > >>gem5
> > >> >> >> >>>>process).
> > >> >> >> >>>>>>We use those information to define a well defined ordering
> > >>of
> > >> >> >> >>packets
> > >> >> >> >>>>>>even
> > >> >> >> >>>>>>if packets are arriving at the same receiver from
> different
> > >> >> >> >>senders.
> > >> >> >> >>>>>>This
> > >> >> >> >>>>>>packet ordering scheme is still being tested so the
> > >> >>corresponding
> > >> >> >> >>>>patch
> > >> >> >> >>>>>>is
> > >> >> >> >>>>>>not on the RB yet.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Please read my previous comment. The most important part
> of
> > >> >> >> >>>>>>multi/pd-gem5
> > >> >> >> >>>>>extension is ensuring accurate and deterministic
> simulation.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>* Amount of changes
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>pd-gem5 introduce different modes in etherlink just to
> > >>provide
> > >> >> >> >>>>accurate
> > >> >> >> >>>>>>>timing for each component in the network subsystem (NIC,
> > >>link,
> > >> >> >> >>>>switch)
> > >> >> >> >>>>>>as
> > >> >> >> >>>>>>>well as capability of modeling different network
> topologies
> > >> >> >>(mesh,
> > >> >> >> >>>>>>ring,
> > >> >> >> >>>>>>>fat tree, etc). To enable a simple functionality, like
> what
> > >> >> >> >>>>multi-gem5
> > >> >> >> >>>>>>>provides, the amount of changes in gem5 can be limited to
> > >> >> >> >>>>time-stamping
> > >> >> >> >>>>>>>packets and providing synchronization through python
> > >>scripts.
> > >> >> >> >>>>However,
> > >> >> >> >>>>>>>multi-gem5 re-implements functionalists that are already
> in
> > >> >>gem5.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>This argument holds only if both implementations are
> correct
> > >> >> >> >>>>(robust).
> > >> >> >> >>>>>>It
> > >> >> >> >>>>>>still seems to me that pd-gem5 does not provide
> correctness
> > >>for
> > >> >> >>the
> > >> >> >> >>>>>>synchronization/checkpointing parts.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Again, please read my first comment for correctness of
> > >>pd-gem5.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>* Integrating with gem5 mainstream:
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>pd-gem5 launch script is written in python which is
> suited
> > >>for
> > >> >> >> >>>>>>integration
> > >> >> >> >>>>>>>with gem5 python scripts. However multi-gem5 uses bash
> > >>script.
> > >> >> >> >>Also,
> > >> >> >> >>>>>>all
> > >> >> >> >>>>>>>source files in pd-gem5 are already parts of gem5
> > >>mainstream.
> > >> >> >> >>>>However
> > >> >> >> >>>>>>>multi-gem5 has tcp_server.cc/hh that is a standalone
> > >>process
> > >> >>and
> > >> >> >> >>>>cannot
> > >> >> >> >>>>>>be
> > >> >> >> >>>>>>>part of gem5.
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>The multi-gem5 launch script is simply enough to rely only
> > >>on
> > >> >>the
> > >> >> >> >>>>>>shell. It
> > >> >> >> >>>>>>can obviously be easily re-written in python if that added
> > >>any
> > >> >> >> >>value.
> > >> >> >> >>>>>>The
> > >> >> >> >>>>>>tcp_server component is only a utility (like the "m5"
> > >>utility
> > >> >>that
> > >> >> >> >>is
> > >> >> >> >>>>>>also
> > >> >> >> >>>>>>part of gem5).
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>The thing is that it's more likely that users want to add
> > >>some
> > >> >> >> >>>>>functionality to the run-script of multi/pd-gem5. E.g.
> > >>pd-gem5
> > >> >> >> >>>>run-script
> > >> >> >> >>>>>supports launching simulations using a simulation pool
> > >> >>management
> > >> >> >> >>>>>software (
> > >> >> >> >>>>>http://research.cs.wisc.edu/htcondor/). Using python
> enables
> > >> >>users
> > >> >> >>to
> > >> >> >> >>>>>easily add these kind of supports.
> > >> >> >> >>>>>
> > >> >> >> >>>>>
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>Cheers,
> > >> >> >> >>>>>>- Gabor
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>
> > >> >> >> >>>>>>>On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
> > >> >> >> >>>><curtis.dun...@arm.com>
> > >> >> >> >>>>>>>wrote:
> > >> >> >> >>>>>>>
> > >> >> >> >>>>>>>>Hello everyone,
> > >> >> >> >>>>>>>>We have taken a look at how pd-gem5 compares with
> > >>multi-gem5.
> > >> >> >> >>>>While
> > >> >> >> >>>>>>>>intending
> > >> >> >> >>>>>>>>to deliver the same functionality, there are some
> crucial
> > >> >> >> >>>>differences:
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>*  Synchronization.
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>    pd-gem5 implements this in Python (not a problem in
> > >> >>itself;
> > >> >> >> >>>>>>>>aesthetically
> > >> >> >> >>>>>>>>    this is nice, but...).  The issue is that pd-gem5's
> > >>data
> > >> >> >> >>>>packets
> > >> >> >> >>>>>>and
> > >> >> >> >>>>>>>>    barrier messages travel over different sockets.
> Since
> > >> >> >> >>pd-gem5
> > >> >> >> >>>>>>could
> > >> >> >> >>>>>>>>see
> > >> >> >> >>>>>>>>    data packets passing synchronization barriers, it
> > >>could
> > >> >> >> >>create
> > >> >> >> >>>>an
> > >> >> >> >>>>>>>>    inconsistent checkpoint.
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>    multi-gem5's synchronization is implemented in C++
> > >>using
> > >> >> >>sync
> > >> >> >> >>>>>>events,
> > >> >> >> >>>>>>>>but
> > >> >> >> >>>>>>>>    more importantly, the messages queue up in the same
> > >> >>stream
> > >> >> >> >>and
> > >> >> >> >>>>so
> > >> >> >> >>>>>>>>cannot
> > >> >> >> >>>>>>>>    have the issue just described.  (Event ordering is
> > >>often
> > >> >> >> >>>>crucial
> > >> >> >> >>>>>>in
> > >> >> >> >>>>>>>>    snapshot protocols.) Therefore we feel that
> multi-gem5
> > >> >>is a
> > >> >> >> >>>>more
> > >> >> >> >>>>>>>>robust
> > >> >> >> >>>>>>>>    solution in this respect.
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>*  Packet handling.
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>    pd-gem5 uses EtherTap for data packets but changed
> the
> > >> >> >> >>polling
> > >> >> >> >>>>>>>>mechanism
> > >> >> >> >>>>>>>>    to go through the main event queue.  Since this rate
> > >>is
> > >> >> >> >>>>actually
> > >> >> >> >>>>>>>>linked
> > >> >> >> >>>>>>>>    with simulator progress, it cannot guarantee that
> the
> > >> >> >>packets
> > >> >> >> >>>>are
> > >> >> >> >>>>>>>>serviced
> > >> >> >> >>>>>>>>    at regular intervals of real time.  This can lead to
> > >> >>packets
> > >> >> >> >>>>>>>>queueing up
> > >> >> >> >>>>>>>>    which would contribute to the synchronization issues
> > >> >> >> >>mentioned
> > >> >> >> >>>>>>above.
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>    multi-gem5 uses plain sockets with separate receive
> > >> >>threads
> > >> >> >> >>>>and so
> > >> >> >> >>>>>>>>does
> > >> >> >> >>>>>>>>not
> > >> >> >> >>>>>>>>    have this issue.
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>* Checkpoint accuracy.
> > >> >> >> >>>>>>>>
> > >> >> >> >>>>>>>>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Reply via email to