Here I’m trying to summarize this long discussion ( I just summarize the points that are discussed in this email thread).
1- Synchronization: using one socket for both communication and synchronization is superior design (multi-gem5). Both pd-gem5 and multi-gem5 use barrier synchronization. 2- Communication: communicating through socket is sufficient and we don’t need to expand this with other programing models. 3- Checkpointing: We should go for synchronized checkpointing Here are some major differences between pd-gem5 and multi-gem5 and their current status: Accuracy: multi-gem5: It sends header packets before each data packet that contains info for precise packet delivery. Ensuring in-order packet delivery is still under test. pd-gem5: Each packet has a time-stamp to ensure precise packet delivery. Accurate communication is well tested. Network topologies: multi-gem5: It can model a star topology. pd-gem5: It can model arbitrary network topologies. Integration (gem5 direction for parallelization): multi-gem5: “multi-gem5 does not support fine grain simulation of hierarchical switches (or any other network topologies except a single crossbar) or multiple synchronization domains currently” pd-gem5: It can be replicated/integrated with single threaded gem5. -Mohammad On Thu, Jul 16, 2015 at 12:12 AM, Steve Reinhardt <[email protected]> wrote: > Sure, I am not saying we will get there soon, but I am glad we agree on > what is desirable. > > Actually my reference to SST was not in regard to multi-threading within a > host, but as a system that parallelizes the simulation of a single > cache-coherent system across multiple hosts. I am not advocating their > approach :). I was just pushing back on your statement that parallelizing > the simulation of a single cache-coherent system is "questionable" by > providing a counter-example. If you want to counter that by calling SST > itself questionable, go ahead; I don't know what their speedup numbers look > like, so I can neither criticize nor defend them on that point. > > When I mentioned that you could probably get decent speedups parallelizing > a large KNL-like coherent system across 4-8 cores, I was thinking of a > single-process parallel model like our multi-queue model, where the > synchronization overheads should be much lower. Also I meant "decent > speedups" with respect to optimized single-threaded simulations, factoring > in the overheads of parallelization. We haven't shown this yet, but I don't > think there are fundamental reasons it couldn't be achieved. > > Anyway, getting back to nearer-term issues, I'll say again that the one > thing I clearly prefer about pd-gem5 over multi-gem5 is that it is using a > real gem5 switch model, which indicates to me that it should be possible to > create a single-process single-threaded gem5 simulation that gets the same > result as a parallel simulation. I don't think you can do that in > multi-gem5, since you have to have the switch model running in its own > process since it's not really a gem5 model. It's not a fatal flaw, and in > the near term there may not even be significant practical consequences, but > to me it's rather inelegant in that it is tying the parallelization and the > simulation model very intimately together, rather than trying to provide a > general framework for multi-host parallel simulation. > > For example, let's say I decided I wanted to model a non-Ethernet network > (maybe InfiniBand?), and wanted to model it in more detail with multiple > switches and links between the switches. Lets further suppose that I > wanted to build a single set of IB switch and link models (as SimObjects) > and use them in two modes: one with trace-driven network traffic, where > perhaps a single-threaded single-process simulation would be fast enough, > and in an execution-driven model, where I would want to parallelize the > simulation across multiple hosts. It seems like that would be a lot more > straightforward in pd-gem5. > > So at a high level it seems to me that a solution that combines the > MultiIface work from multi-gem5 with the pd-gem5 switch model would be the > best of both. I haven't looked at the code closely enough to know why that > won't work, so I'll let you tell me. > > Steve > > > On Mon, Jul 13, 2015 at 12:24 AM Andreas Hansson <[email protected]> > wrote: > > > Hi Steve, > > > > Thanks for the elaborate comments. I agree with all the points of what is > > desired, but I am also painfully aware of some real empirical data points > > suggesting it will be difficult, if not impossible, to get there. > > > > To take a concrete example, you mention SST for multi-threading within > one > > host. It may well give you 4-8X speedup doing so, but comparing gem5 > > classic to SST, we are looking at roughly a 4X speed difference. Hence, > > you only gain back what you lost in making it multi-threaded in the first > > place, and now you are using ~8X the resources. Hence my worry with > trying > > to find one mechanism or doing it all within the simulator. I hope you’re > > right (and I’m wrong), but I would like to see at least one data point > > hinting that it is possible to achieve what you are describing. So far I > > am not convinced. > > > > Andreas > > > > > > On 13/07/2015 05:36, "gem5-dev on behalf of Steve Reinhardt" > > <[email protected] on behalf of [email protected]> wrote: > > > > >Hi Andreas, > > > > > >Thanks for the comments---I partially agree, but I think the structure > of > > >your comments is the most interesting to me, as I believe it reveals a > > >difference in our thinking. I'll elaborate below. (Now that I'm done, > I'll > > >apologize in advance for perhaps elaborating too much!) > > > > > >On Wed, Jul 8, 2015 at 1:23 PM Andreas Hansson <[email protected] > > > > >wrote: > > > > > >> Gents, > > >> > > >> I’ll let Gabor expound on the value of the non-synchronised > checkpoints. > > >> > > >> When it comes to the parallelisation, I think it is pretty clear that: > > >> > > >> 1. The value of parallelising a single(cache coherent) gem5 instance > is > > >> questionable, > > > > > > > > >I think that depends a lot on the parameters. If you are trying to > model > > >a > > >quad-core system on a quad-core system, then I agree with you. However, > if > > >the number of simulated cores >> the number of host cores, it can make > > >sense, as then each host core will model multiple simulated cores, and > the > > >relative overhead of synchronization will go down. So if you're trying > to > > >model something like a Knights Landing chip with 60+ cores, I expect you > > >could get pretty decent speedup if you parallelized the simulation > across > > >4-8 host cores. > > > > > >Things also look a little different if you're doing heterogeneous nodes; > > >perhaps you might benefit from having one thread model all the CPUs > while > > >another thread (or few threads) are used to model the GPU. > > > > > >Note that, IIRC, the SST folks at Sandia are mostly using SST to model > > >large-scale multi-threaded systems, not distributed message-passing > > >systems---and this is using MPI for parallelization, not shared memory. > > > > > > > > >> and the cost of making gem5 thread safe is high. > > > > > > > > >While this is indisputably true for the patch we have up on reviewboard, > > >I'm not convinced that's a fundamental truth. I think that with some > > >effort this cost can be driven down a lot. > > > > > > > > >> That said, > > >> if someone wants to do it, the multi-event-queue approach seems like a > > >> good start. > > >> > > > > > >No argument there. > > > > > > > > >> > > >> 2. Parallelising gem5 on the node level and the inter-node level, > using > > >> one mechanism seems like an odd goal. > > > > > > > > >When you say "node" here, do you mean host node or simulated node? If > the > > >former, I agree; if the latter, I disagree. > > > > > >In particular, if you mean the latter, then the extrapolation of what > > >you're saying is that we will end up with one model of a multi-node > system > > >if we're going to run the model on a single host, and a different model > of > > >the same multi-node system if we intend to run the model on multiple > > >hosts---like what we see now with multi-gem5 where the switch model for > a > > >distributed simulation isn't even a gem5 model and couldn't be used if > you > > >wanted to run the whole model inside a single gem5 process. Having a > > >single > > >simulation model that doesn't change regardless of how we execute the > > >simulation seems a lot more elegant to me, and we actually achieve that > > >with the multi-event-queue feature. Obviously there will be practical > > >constraints on how a model can be partitioned across multiple host > nodes, > > >and little things like instantiating a different flavor of EtherLink > > >depending on whether it's an intra-host-node or inter-host-node > connection > > >don't bother me that much, but to the extent possible I believe we > should > > >keep those as merely practical constraints and not fundamental > > >limitations. > > > > > > > > >> Just like OpenMP and OpenMPI are > > >> well suited for different communication mechanisms, I would argue that > > >>we > > >> need parallelisation techniques well suited for the systems the > > >>simulation > > >> will run on. > > > > > > > > >Yes, I agree, we need a message-based parallelization scheme for > > >multi-node > > >hosts, and a shared-memory based scheme for intra-host-node > > >parallelization. Two different techniques for two different > environments. > > >But that doesn't mean they can't co-exist & complement each other, > rather > > >than being mutually exclusive options, much like many programs are > written > > >in MPI+OpenMP. > > > > > > > > >> A very natural (and efficient) way of doing things is to map > > >> each gem5 instance (and thus simulated node), to a host machine, and > > >>have > > >> the host machines communicate over Ethernet. > > >> > > > > > >That's certainly natural if the number of simulated nodes is equal to > the > > >number of host nodes. It's not so obvious to me that you want every > > >simulated node in its own gem5 process, communicating over sockets, if > the > > >number of simulated nodes >> the number of host nodes. Sure, given that > > >the > > >code is written, that's a quick way to get things working while we > polish > > >up the multi-event-queue fixes, but that doesn't mean it's the ideal > > >long-term strategy. In particular, if you go to the degenerate case > where > > >we have multiple simulated nodes and a single host node, then using > > >multiple processes means we have two different parallelization > strategies > > >for running on a multi-core shared-memory host. Not that we would (or > > >could) ban people from running multiple gem5 instances on a single > system, > > >but a more relevant question is, given a finite amount of effort, would > we > > >want to spend it on writing a shared-memory backend for MultiIface or on > > >addressing the performance issues in making a single gem5 process > > >thread-safe? Obviously I favor the latter because I think it's a more > > >general solution, and I believe one that will lead to higher performance > > >on > > >single-node hosts in the end. > > > > > >Note that I'm definitely not saying that all of this needs to be > > >implemented before we commit anything from multi-gem5 or pd-gem5. I'm > just > > >trying to establish a vision for where we think gem5 should go with > > >respect > > >to parallelization, so that we can choose short-term steps that align > best > > >with that destination, even if they are only initial steps down that > path. > > > > > >Steve > > > > > > > > >> > > >> Do you agree? > > >> > > >> Andreas > > >> > > >_______________________________________________ > > >gem5-dev mailing list > > >[email protected] > > >http://m5sim.org/mailman/listinfo/gem5-dev > > > > > > -- IMPORTANT NOTICE: The contents of this email and any attachments are > > confidential and may also be privileged. If you are not the intended > > recipient, please notify the sender immediately and do not disclose the > > contents to any other person, use it for any purpose, or store or copy > the > > information in any medium. Thank you. > > > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > > Registered in England & Wales, Company No: 2557590 > > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, > > Registered in England & Wales, Company No: 2548782 > > _______________________________________________ > > gem5-dev mailing list > > [email protected] > > http://m5sim.org/mailman/listinfo/gem5-dev > > > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
