Hi all, Sorry for taking long to update pd-gem5 patch.
Here is the new set of patches: http://reviews.gem5.org/r/3025/ http://reviews.gem5.org/r/3024/ http://reviews.gem5.org/r/3023/ These three are new pd-gem5 patches. I discarded the old patches as I have change them extensively. http://reviews.gem5.org/r/3021/ This a redistribution of http://reviews.gem5.org/r/2305/, this is independent from pd-gem5 patch. But since it's a well written switch model with some features, I think there is value to revisit and commit this patch. I have done some slight modification and fixed some bugs in it, however it still has the issue raised by Steve. I changed the patch in order to address the its problems based on the discussion in this email thread. I think this design pretty much has the best of both pd-gem5 and multi-gem5. Here is an overview of the new pdgem5 patch: synchronization: It uses one socket connection for both delivering data packets and sync messages; so data packets never bypass sync barriers. Synchronization is done inside EtherTap interface by scheduling sendSync & recvSync events without any need for a separate process governing the simulation ("tcp-server" in multi-gem5 or "barrier" process in old pd-gem5). In each pd-simulation, there is a central switch box that is responsible for forwarding data packets and managing sendSync & recvSync messages from other gem5 processes (it still allows hierarchical simulation that we discussed earlier). communication: Is taken place via EtherTap interface. checkpoint: it implements synchronous checkpointing This new patch provides robustness while preserving all the feature of previous pd-gem5 design with even cleaner and easier to use implementation. Regarding hierarchical simulation, I was able to run a simulation with 8 gem5 processes to simulate 32 nodes (each gem5 process simulating 4 full-systems with a local switch and an up-link) and one gem5 process to simulate a top level switch box (32 node + 9 switches in total). However this would be useless with single threaded gem5, it would be handy and fast with multi-threaded gem5. Thanks, Mohammad On Tue, Jul 21, 2015 at 3:30 PM, Mohammad Alian <[email protected]> wrote: > Hi Steve, > > Sorry for the misinterpretation, my comment on communication is not > correct and there are certainly values in using other programming models. > My argument is that if we could get multi-threaded gem5 working and model > multiple nodes inside one multi-threaded gem5 process (which is definitely > faster that using separate gem5 processes), then it's reasonable to assume > that the number of nodes in pd-gem5 would not exceed a couple of gem5 > processes, and as we can synchronize pd-gem5 nodes in higher granularity, > then it will fade the value of having higher performance communication > programming models. Nevertheless, although pd-gem5 doesn't provide abstract > functions for communication, I think the effort for implementing it with > other programming model would be in par with effort needed for multi-gem5. > > Maybe I'm missing something, but I didn't understand your point here: > "Note that this doesn't strictly require that the switch model is > co-located with this central coordination process; that just happens to be > convenient and efficient.". I cannot see how we can have both central > server and switch model (potentially distributed), because central server > do packet routing which supposed to be done in switch model. > > Thank you for proposing the reduction method for ensuring on-time packet > delivery. We can implement this in pd-gem5, and I'm working on that right > now. Also, I modified pd-gem5 patch in order to enable replication of same > simulation inside one process as well as distributing switch model within > full-system gem5 processes (hierarchical network topology). This should > also work with multi-threaded gem5 as the synchronization of pd-gem5 nodes > is independent of internal synchronizations of multi-threaded gem5 > processes. I'll update pd-gem5 patch soon. > > Thank you, > Mohammad > > On Sat, Jul 18, 2015 at 7:44 PM, Steve Reinhardt <[email protected]> wrote: > >> Hi Mohammad, >> >> Thanks for the summaries & responses. >> >> I agree with your summaries on synchronization and checkpointing. However, >> as far as communication goes, I'd like to clarify, as I'm not sure exactly >> what you mean by "communicating through socket is sufficient and we don’t >> need >> to expand this with other programming models". Socket communication is >> sufficient for now, but I think there is a potentially a lot of value in >> being able to take advantage of higher-performance networking models such >> as MPI and InfiniBand. That's what I like about MultiIface, is that it >> provides an abstraction that should enable the development of other >> messaging layers. It's a little premature to know exactly how well that >> will work until you try and develop a second implementation, but at least >> the concept is right. >> >> As far as synchronization, that's a harder problem. I do think we need to >> have a model that always works, not almost always works. This is >> particularly challenging with sockets, which weren't built for fine-grain >> communication and synchronization, which is one reason why I think there >> would be a lot of value in moving to more HPC-oriented communication >> models >> like MPI on systems that support it. In fact, I know that the common MPI >> platforms (MPICH and Open MPI) both support Ethernet transport, so it >> might >> even be the case that coding to MPI would provide performance that's as >> good as or better then going directly to sockets. >> >> The advantage of having all communication routed through a single central >> server (as in multi-gem5) is that you can provide ordering guarantees >> between the point-to-point messages and the barrier messages, so that you >> can guarantee that a message that's sent before a barrier is initiated has >> been received and processed before that barrier completes. Note that this >> doesn't strictly require that the switch model is co-located with this >> central coordination process; that just happens to be convenient and >> efficient. >> >> There are other ways to guarantee all messages have been delivered other >> than just using socket ordering. For example, if you have each node track >> the net number of messages it has sent (msgs sent - msgs rcvd) and do a >> reduction on this value instead of a simpler barrier, then you now all >> messages have been delivered when this value reaches zero. (In fact, IIRC, >> that's what we did in WWT, using the CM-5 hardware reduction network.) You >> have to iterate over the reduction until the value is zero, which could >> theoretically cost some performance, but depending on the timing of the >> network the number of iterations would be small---though you'd have to hit >> zero on the first try most of the time to give the same performance as a >> barrier. >> >> So I admit I hadn't thought enough about this before, but we shouldn't >> consider the multi-gem5 and pd-gem5 approaches as the only two possible >> ways of doing communication & synchronization. >> >> I'm going to consult with our local MPI expert and see what he thinks >> about >> using MPI here. >> >> Steve >> >> On Thu, Jul 16, 2015 at 1:07 PM Mohammad Alian <[email protected]> wrote: >> >> > Hi, >> > >> > Regarding combining MultiIface and pdgem5 network model, my >> understanding >> > is that MultiIface design is tightly dependent on a centralized module >> that >> > do both packet forwarding and synchronization at the same place >> (tcpserver >> > in multi-gem5). The first thing that comes into mind is to integrate >> > barrier process capabilities into the switch box model in pd-gem5. But >> by >> > doing this, we should give up on some of the pd-gem5 features that are >> > desirable. e.g, it will refrain us from having hierarchical network >> > topologies (having local TOR switches inside each gem5 process which is >> > simulating a rack), also it would introduce subtle issues if we want to >> > integrate it among multiple synchronization domains some day. >> > >> > Maybe I didn't fully understand MultiIface. Gabor, please correct me if >> I'm >> > wrong ... >> > >> > >> > I understand your concerns about the robustness of the implementation, >> but >> > doing synchronization independently has some benefits that you cannot >> > achieve without it. Nevertheless, as I mentioned before, consider that >> > packet arrival violation almost never happens, and in those rare cases >> we >> > can detect it and terminate simulation. Please consider that we are >> > synchronizing gem5 processes which are orders of magnitude slower than >> > physical hardware. Theoretically this violation happens when wall clock >> > time of sending a data packet from source EtherTap (socket) to the >> > destination one takes more than wall clock time of completing one global >> > synchronization (sending a sync message to barrier process, receive back >> > sync message from barrier and simulate a quantum), which itself has two >> > back and forth socket communications between gem5 processes and barrier. >> > >> > Thanks, >> > Mohammad >> > >> > >> > On Thu, Jul 16, 2015 at 12:12 AM, Steve Reinhardt <[email protected]> >> > wrote: >> > >> > > Sure, I am not saying we will get there soon, but I am glad we agree >> on >> > > what is desirable. >> > > >> > > Actually my reference to SST was not in regard to multi-threading >> within >> > a >> > > host, but as a system that parallelizes the simulation of a single >> > > cache-coherent system across multiple hosts. I am not advocating their >> > > approach :). I was just pushing back on your statement that >> parallelizing >> > > the simulation of a single cache-coherent system is "questionable" by >> > > providing a counter-example. If you want to counter that by calling >> SST >> > > itself questionable, go ahead; I don't know what their speedup numbers >> > look >> > > like, so I can neither criticize nor defend them on that point. >> > > >> > > When I mentioned that you could probably get decent speedups >> > parallelizing >> > > a large KNL-like coherent system across 4-8 cores, I was thinking of a >> > > single-process parallel model like our multi-queue model, where the >> > > synchronization overheads should be much lower. Also I meant "decent >> > > speedups" with respect to optimized single-threaded simulations, >> > factoring >> > > in the overheads of parallelization. We haven't shown this yet, but I >> > don't >> > > think there are fundamental reasons it couldn't be achieved. >> > > >> > > Anyway, getting back to nearer-term issues, I'll say again that the >> one >> > > thing I clearly prefer about pd-gem5 over multi-gem5 is that it is >> using >> > a >> > > real gem5 switch model, which indicates to me that it should be >> possible >> > to >> > > create a single-process single-threaded gem5 simulation that gets the >> > same >> > > result as a parallel simulation. I don't think you can do that in >> > > multi-gem5, since you have to have the switch model running in its own >> > > process since it's not really a gem5 model. It's not a fatal flaw, >> and in >> > > the near term there may not even be significant practical >> consequences, >> > but >> > > to me it's rather inelegant in that it is tying the parallelization >> and >> > the >> > > simulation model very intimately together, rather than trying to >> provide >> > a >> > > general framework for multi-host parallel simulation. >> > > >> > > For example, let's say I decided I wanted to model a non-Ethernet >> network >> > > (maybe InfiniBand?), and wanted to model it in more detail with >> multiple >> > > switches and links between the switches. Lets further suppose that I >> > > wanted to build a single set of IB switch and link models (as >> SimObjects) >> > > and use them in two modes: one with trace-driven network traffic, >> where >> > > perhaps a single-threaded single-process simulation would be fast >> enough, >> > > and in an execution-driven model, where I would want to parallelize >> the >> > > simulation across multiple hosts. It seems like that would be a lot >> more >> > > straightforward in pd-gem5. >> > > >> > > So at a high level it seems to me that a solution that combines the >> > > MultiIface work from multi-gem5 with the pd-gem5 switch model would be >> > the >> > > best of both. I haven't looked at the code closely enough to know why >> > that >> > > won't work, so I'll let you tell me. >> > > >> > > Steve >> > > >> > > >> > > On Mon, Jul 13, 2015 at 12:24 AM Andreas Hansson < >> > [email protected]> >> > > wrote: >> > > >> > > > Hi Steve, >> > > > >> > > > Thanks for the elaborate comments. I agree with all the points of >> what >> > is >> > > > desired, but I am also painfully aware of some real empirical data >> > points >> > > > suggesting it will be difficult, if not impossible, to get there. >> > > > >> > > > To take a concrete example, you mention SST for multi-threading >> within >> > > one >> > > > host. It may well give you 4-8X speedup doing so, but comparing gem5 >> > > > classic to SST, we are looking at roughly a 4X speed difference. >> Hence, >> > > > you only gain back what you lost in making it multi-threaded in the >> > first >> > > > place, and now you are using ~8X the resources. Hence my worry with >> > > trying >> > > > to find one mechanism or doing it all within the simulator. I hope >> > you’re >> > > > right (and I’m wrong), but I would like to see at least one data >> point >> > > > hinting that it is possible to achieve what you are describing. So >> far >> > I >> > > > am not convinced. >> > > > >> > > > Andreas >> > > > >> > > > >> > > > On 13/07/2015 05:36, "gem5-dev on behalf of Steve Reinhardt" >> > > > <[email protected] on behalf of [email protected]> wrote: >> > > > >> > > > >Hi Andreas, >> > > > > >> > > > >Thanks for the comments---I partially agree, but I think the >> structure >> > > of >> > > > >your comments is the most interesting to me, as I believe it >> reveals a >> > > > >difference in our thinking. I'll elaborate below. (Now that I'm >> done, >> > > I'll >> > > > >apologize in advance for perhaps elaborating too much!) >> > > > > >> > > > >On Wed, Jul 8, 2015 at 1:23 PM Andreas Hansson < >> > [email protected] >> > > > >> > > > >wrote: >> > > > > >> > > > >> Gents, >> > > > >> >> > > > >> I’ll let Gabor expound on the value of the non-synchronised >> > > checkpoints. >> > > > >> >> > > > >> When it comes to the parallelisation, I think it is pretty clear >> > that: >> > > > >> >> > > > >> 1. The value of parallelising a single(cache coherent) gem5 >> instance >> > > is >> > > > >> questionable, >> > > > > >> > > > > >> > > > >I think that depends a lot on the parameters. If you are trying to >> > > model >> > > > >a >> > > > >quad-core system on a quad-core system, then I agree with you. >> > However, >> > > if >> > > > >the number of simulated cores >> the number of host cores, it can >> make >> > > > >sense, as then each host core will model multiple simulated cores, >> and >> > > the >> > > > >relative overhead of synchronization will go down. So if you're >> trying >> > > to >> > > > >model something like a Knights Landing chip with 60+ cores, I >> expect >> > you >> > > > >could get pretty decent speedup if you parallelized the simulation >> > > across >> > > > >4-8 host cores. >> > > > > >> > > > >Things also look a little different if you're doing heterogeneous >> > nodes; >> > > > >perhaps you might benefit from having one thread model all the CPUs >> > > while >> > > > >another thread (or few threads) are used to model the GPU. >> > > > > >> > > > >Note that, IIRC, the SST folks at Sandia are mostly using SST to >> model >> > > > >large-scale multi-threaded systems, not distributed message-passing >> > > > >systems---and this is using MPI for parallelization, not shared >> > memory. >> > > > > >> > > > > >> > > > >> and the cost of making gem5 thread safe is high. >> > > > > >> > > > > >> > > > >While this is indisputably true for the patch we have up on >> > reviewboard, >> > > > >I'm not convinced that's a fundamental truth. I think that with >> some >> > > > >effort this cost can be driven down a lot. >> > > > > >> > > > > >> > > > >> That said, >> > > > >> if someone wants to do it, the multi-event-queue approach seems >> > like a >> > > > >> good start. >> > > > >> >> > > > > >> > > > >No argument there. >> > > > > >> > > > > >> > > > >> >> > > > >> 2. Parallelising gem5 on the node level and the inter-node level, >> > > using >> > > > >> one mechanism seems like an odd goal. >> > > > > >> > > > > >> > > > >When you say "node" here, do you mean host node or simulated node? >> If >> > > the >> > > > >former, I agree; if the latter, I disagree. >> > > > > >> > > > >In particular, if you mean the latter, then the extrapolation of >> what >> > > > >you're saying is that we will end up with one model of a multi-node >> > > system >> > > > >if we're going to run the model on a single host, and a different >> > model >> > > of >> > > > >the same multi-node system if we intend to run the model on >> multiple >> > > > >hosts---like what we see now with multi-gem5 where the switch model >> > for >> > > a >> > > > >distributed simulation isn't even a gem5 model and couldn't be >> used if >> > > you >> > > > >wanted to run the whole model inside a single gem5 process. Having >> a >> > > > >single >> > > > >simulation model that doesn't change regardless of how we execute >> the >> > > > >simulation seems a lot more elegant to me, and we actually achieve >> > that >> > > > >with the multi-event-queue feature. Obviously there will be >> practical >> > > > >constraints on how a model can be partitioned across multiple host >> > > nodes, >> > > > >and little things like instantiating a different flavor of >> EtherLink >> > > > >depending on whether it's an intra-host-node or inter-host-node >> > > connection >> > > > >don't bother me that much, but to the extent possible I believe we >> > > should >> > > > >keep those as merely practical constraints and not fundamental >> > > > >limitations. >> > > > > >> > > > > >> > > > >> Just like OpenMP and OpenMPI are >> > > > >> well suited for different communication mechanisms, I would argue >> > that >> > > > >>we >> > > > >> need parallelisation techniques well suited for the systems the >> > > > >>simulation >> > > > >> will run on. >> > > > > >> > > > > >> > > > >Yes, I agree, we need a message-based parallelization scheme for >> > > > >multi-node >> > > > >hosts, and a shared-memory based scheme for intra-host-node >> > > > >parallelization. Two different techniques for two different >> > > environments. >> > > > >But that doesn't mean they can't co-exist & complement each other, >> > > rather >> > > > >than being mutually exclusive options, much like many programs are >> > > written >> > > > >in MPI+OpenMP. >> > > > > >> > > > > >> > > > >> A very natural (and efficient) way of doing things is to map >> > > > >> each gem5 instance (and thus simulated node), to a host machine, >> and >> > > > >>have >> > > > >> the host machines communicate over Ethernet. >> > > > >> >> > > > > >> > > > >That's certainly natural if the number of simulated nodes is equal >> to >> > > the >> > > > >number of host nodes. It's not so obvious to me that you want every >> > > > >simulated node in its own gem5 process, communicating over >> sockets, if >> > > the >> > > > >number of simulated nodes >> the number of host nodes. Sure, given >> > that >> > > > >the >> > > > >code is written, that's a quick way to get things working while we >> > > polish >> > > > >up the multi-event-queue fixes, but that doesn't mean it's the >> ideal >> > > > >long-term strategy. In particular, if you go to the degenerate case >> > > where >> > > > >we have multiple simulated nodes and a single host node, then using >> > > > >multiple processes means we have two different parallelization >> > > strategies >> > > > >for running on a multi-core shared-memory host. Not that we would >> (or >> > > > >could) ban people from running multiple gem5 instances on a single >> > > system, >> > > > >but a more relevant question is, given a finite amount of effort, >> > would >> > > we >> > > > >want to spend it on writing a shared-memory backend for MultiIface >> or >> > on >> > > > >addressing the performance issues in making a single gem5 process >> > > > >thread-safe? Obviously I favor the latter because I think it's a >> more >> > > > >general solution, and I believe one that will lead to higher >> > performance >> > > > >on >> > > > >single-node hosts in the end. >> > > > > >> > > > >Note that I'm definitely not saying that all of this needs to be >> > > > >implemented before we commit anything from multi-gem5 or pd-gem5. >> I'm >> > > just >> > > > >trying to establish a vision for where we think gem5 should go with >> > > > >respect >> > > > >to parallelization, so that we can choose short-term steps that >> align >> > > best >> > > > >with that destination, even if they are only initial steps down >> that >> > > path. >> > > > > >> > > > >Steve >> > > > > >> > > > > >> > > > >> >> > > > >> Do you agree? >> > > > >> >> > > > >> Andreas >> > > > >> >> > > > >_______________________________________________ >> > > > >gem5-dev mailing list >> > > > >[email protected] >> > > > >http://m5sim.org/mailman/listinfo/gem5-dev >> > > > >> > > > >> > > > -- IMPORTANT NOTICE: The contents of this email and any attachments >> are >> > > > confidential and may also be privileged. If you are not the intended >> > > > recipient, please notify the sender immediately and do not disclose >> the >> > > > contents to any other person, use it for any purpose, or store or >> copy >> > > the >> > > > information in any medium. Thank you. >> > > > >> > > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, >> > > > Registered in England & Wales, Company No: 2557590 >> > > > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 >> > 9NJ, >> > > > Registered in England & Wales, Company No: 2548782 >> > > > _______________________________________________ >> > > > gem5-dev mailing list >> > > > [email protected] >> > > > http://m5sim.org/mailman/listinfo/gem5-dev >> > > > >> > > _______________________________________________ >> > > gem5-dev mailing list >> > > [email protected] >> > > http://m5sim.org/mailman/listinfo/gem5-dev >> > > >> > _______________________________________________ >> > gem5-dev mailing list >> > [email protected] >> > http://m5sim.org/mailman/listinfo/gem5-dev >> > >> _______________________________________________ >> gem5-dev mailing list >> [email protected] >> http://m5sim.org/mailman/listinfo/gem5-dev >> > > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
