Spark gives you four of the classical collectives: broadcast, reduce,
scatter, and gather.  There are also a few additional primitives, mostly
based on a join.  Spark is certainly less optimized than MPI for these, but
maybe that isn't such a big deal.  Spark has one theoretical disadvantage
compared to MPI: every collective operation requires the task closures to
be distributed, and---to my knowledge---this is an O(p) operation.
 (Perhaps there has been some progress on this??)  That O(p) term spoils
any parallel isoefficiency analysis.  In MPI, binaries are distributed
once, and wireup is a O(log p).  In practice, it prevents
reasonable-looking strong scaling curves; with MPI, the overall runtime
will stop declining and level off with increasing p, but with Spark it can
go up sharply.  So, Spark is great for a small cluster.  For a huge
cluster, or a job with a lot of collectives, it isn't so great.


On Mon, Jun 16, 2014 at 1:36 PM, Bertrand Dechoux <decho...@gmail.com>
wrote:

> I guess you have to understand the difference of architecture. I don't
> know much about C++ MPI but it is basically MPI whereas Spark is inspired
> from Hadoop MapReduce and optimised for reading/writing large amount of
> data with a smart caching and locality strategy. Intuitively, if you have a
> high ratio CPU/message then MPI might be better. But what is the ratio is
> hard to say and in the end this ratio will depend on your specific
> application. Finally, in real life, this difference of performance due to
> the architecture may not be the only or the most important factor of choice
> like Michael already explained.
>
> Bertrand
>
> On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler <mich...@tumra.com> wrote:
>
>> Hello Wei,
>>
>> I talk from experience of writing many HPC distributed application using
>> Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
>> Virtual Machine (PVM) way before that back in the 90's.  I can say with
>> absolute certainty:
>>
>> *Any gains you believe there are because "C++ is faster than Java/Scala"
>> will be completely blown by the inordinate amount of time you spend
>> debugging your code and/or reinventing the wheel to do even basic tasks
>> like linear regression.*
>>
>>
>> There are undoubtably some very specialised use-cases where MPI and its
>> brethren still dominate for High Performance Computing tasks -- like for
>> example the nuclear decay simulations run by the US Department of Energy on
>> supercomputers where they've invested billions solving that use case.
>>
>> Spark is part of the wider "Big Data" ecosystem, and its biggest
>> advantages are traction amongst internet scale companies, hundreds of
>> developers contributing to it and a community of thousands using it.
>>
>> Need a distributed fault-tolerant file system? Use HDFS.  Need a
>> distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
>> between your worker processes? Use Zookeeper.  Need to run it on a flexible
>> grid of computing resources and handle failures? Run it on Mesos!
>>
>> The barrier to entry to get going with Spark is very low, download the
>> latest distribution and start the Spark shell.  Language bindings for Scala
>> / Java / Python are excellent meaning you spend less time writing
>> boilerplate code, and more time solving problems.
>>
>> Even if you believe you *need* to use native code to do something
>> specific, like fetching HD video frames from satellite video capture cards
>> -- wrap it in a small native library and use the Java Native Access
>> interface to call it from your Java/Scala code.
>>
>> Have fun, and if you get stuck we're here to help!
>>
>> MC
>>
>>
>> On 16 June 2014 08:17, Wei Da <xwd0...@gmail.com> wrote:
>>
>>> Hi guys,
>>> We are making choices between C++ MPI and Spark. Is there any official
>>> comparation between them? Thanks a lot!
>>>
>>> Wei
>>>
>>
>>
>

Reply via email to