Spark gives you four of the classical collectives: broadcast, reduce, scatter, and gather. There are also a few additional primitives, mostly based on a join. Spark is certainly less optimized than MPI for these, but maybe that isn't such a big deal. Spark has one theoretical disadvantage compared to MPI: every collective operation requires the task closures to be distributed, and---to my knowledge---this is an O(p) operation. (Perhaps there has been some progress on this??) That O(p) term spoils any parallel isoefficiency analysis. In MPI, binaries are distributed once, and wireup is a O(log p). In practice, it prevents reasonable-looking strong scaling curves; with MPI, the overall runtime will stop declining and level off with increasing p, but with Spark it can go up sharply. So, Spark is great for a small cluster. For a huge cluster, or a job with a lot of collectives, it isn't so great.
On Mon, Jun 16, 2014 at 1:36 PM, Bertrand Dechoux <decho...@gmail.com> wrote: > I guess you have to understand the difference of architecture. I don't > know much about C++ MPI but it is basically MPI whereas Spark is inspired > from Hadoop MapReduce and optimised for reading/writing large amount of > data with a smart caching and locality strategy. Intuitively, if you have a > high ratio CPU/message then MPI might be better. But what is the ratio is > hard to say and in the end this ratio will depend on your specific > application. Finally, in real life, this difference of performance due to > the architecture may not be the only or the most important factor of choice > like Michael already explained. > > Bertrand > > On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler <mich...@tumra.com> wrote: > >> Hello Wei, >> >> I talk from experience of writing many HPC distributed application using >> Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel >> Virtual Machine (PVM) way before that back in the 90's. I can say with >> absolute certainty: >> >> *Any gains you believe there are because "C++ is faster than Java/Scala" >> will be completely blown by the inordinate amount of time you spend >> debugging your code and/or reinventing the wheel to do even basic tasks >> like linear regression.* >> >> >> There are undoubtably some very specialised use-cases where MPI and its >> brethren still dominate for High Performance Computing tasks -- like for >> example the nuclear decay simulations run by the US Department of Energy on >> supercomputers where they've invested billions solving that use case. >> >> Spark is part of the wider "Big Data" ecosystem, and its biggest >> advantages are traction amongst internet scale companies, hundreds of >> developers contributing to it and a community of thousands using it. >> >> Need a distributed fault-tolerant file system? Use HDFS. Need a >> distributed/fault-tolerant message-queue? Use Kafka. Need to co-ordinate >> between your worker processes? Use Zookeeper. Need to run it on a flexible >> grid of computing resources and handle failures? Run it on Mesos! >> >> The barrier to entry to get going with Spark is very low, download the >> latest distribution and start the Spark shell. Language bindings for Scala >> / Java / Python are excellent meaning you spend less time writing >> boilerplate code, and more time solving problems. >> >> Even if you believe you *need* to use native code to do something >> specific, like fetching HD video frames from satellite video capture cards >> -- wrap it in a small native library and use the Java Native Access >> interface to call it from your Java/Scala code. >> >> Have fun, and if you get stuck we're here to help! >> >> MC >> >> >> On 16 June 2014 08:17, Wei Da <xwd0...@gmail.com> wrote: >> >>> Hi guys, >>> We are making choices between C++ MPI and Spark. Is there any official >>> comparation between them? Thanks a lot! >>> >>> Wei >>> >> >> >