Re: [stratosphere-dev] Spark comparison

Kostas Tzoumas Sun, 31 Aug 2014 06:43:50 -0700

Hi Anirvan,

I am not sure if the discussion on level of hype, sales, etc is relevant to
the dev@ mailing list.


My opinion on your first two questions:

1. An updated performance comparison would be indeed very nice. There is
some work started by some Flink contributors to create some performance
scripts for Flink, Spark, and MapReduce here:
https://github.com/project-flink/flink-perf. Help in this direction would
definitely be very welcome, perhaps you would be interested in contributing
there! As Stephan said, the community still needs to do some work on how
the runtime deals with serialized data in order for a performance
comparison to make sense for Flink.

Keep in mind that Flink is an open source project, you do not need
permission by anyone to publish studies on it in conferences ;-)

2. I would focus on measures of performance and scalability in various
setups, data sets, and job complexities. One could also think about
usability, but getting this done objectively is in my experience hard and
time consuming (requires user studies) and the results may be obsolete
soon, as Flink is adding API features.

Kostas


On Sat, Aug 30, 2014 at 8:14 PM, Anirvan Basu <
anirvan.b...@alumni.insead.edu> wrote:

> Stephan et Kostas,
>
> I agree that the study is 1-yr old (so old in terms of dev timeframe for
> both these projects).
> Seems that Spark has caught up good wind on its sails - Google, Facebook,
> Yahoo, IBM ... what about you folks ?
> Are you also pitching these giants ? Let's assume that it is a fat-tail
> scenario.
> Appears to me something similar to MongoDB in the NoSQL world (compared to
> Raven or Couch) :-) Still need to figure hype or reality!
>
> I tried Spark 1.0.2 this week:
> - installation was fairly simple,
> - the Python API was easy to do some beyond-hello world programmes,(did
> not check their R package though)
> - they also have a good streaming package,
> - advantage was a good series of tutorials & webinars (helps to get rid of
> the fear of "jumping into the water" for dummies like me)
>
> Some pertinent questions:
> 1. Would you be interested, if we did a neutral comparison of Flink and
> Spark, baselined to Hadoop M-R framework ? I was also thinking of adding
> Summingbird - would like to know your viewpoints there.
> If we did publish, we would try to present it in some conference
> naturally! So think of the perils as well ;-)
> Actually, Robert had asked me a similar question - he put the idea in my
> head!
>
> 2. To what set of criteria would you want to compare Flink and Spark ?
>
> 3. Where do you stand for graph-based algos ? We are looking for a stable
> framework for graph-based programmes -like balanced graph partitioning,
> evolution, ... - that way the Spark graphx appeared very interesting.
> I know you have your own Spargel there - so how do you compare? Do you
> also do vertex-based balanced partitioning (for e.g. JA-BE-JA k-way
> partitioning) ? Can you do edge-based partitioning ? I didn't come across
> any framework that realizes the latter.
> Here attached is a simple paper presented by an Italian research group -
> they jumped on to the Spark bandwagon!
> Let me know your opinions (perhaps, you may know the group already)
>
> Best !
> Anirvan
>
>
> -----Original Message-----
> From: Stephan Ewen [mailto:se...@apache.org]
> Sent: samedi 30 août 2014 18:26
> To: dev@flink.incubator.apache.org
> Subject: Re: [stratosphere-dev] Spark comparison
>
> Hi!
>
> I agree with Kostas, the code base of Stratosphere that was used was quite
> old.
>
> The current Flnk version is different already, with the new APIs and
> different type handling.
>
> Flink is taking a route that makes sure that the runtime is very robust,
> memory wise. We pay currently a few CPU cycles overhead for that, but we
> have an effort gong to bring that down.
>
> It would be interesting to rerun the experiments then...
>
> Greetings,
> Stephan
>
>
>
> On Sat, Aug 30, 2014 at 9:16 AM, Kostas Tzoumas <ktzou...@apache.org>
> wrote:
>
> > Hi Anirvan,
> >
> > Yes, I am familiar with this thesis. I think that this comparison is
> > by now quite old (>1 year if I am not mistaken), and both systems have
> > evolved substantially since then.
> >
> > Kostas
> >
> >
> > On Fri, Aug 29, 2014 at 7:01 PM, Robert Metzger <rmetz...@apache.org>
> > wrote:
> >
> > > Forwarding the message to the new mailing list ...
> > >
> > > ---------- Forwarded message ----------
> > > From: Nirvanesque <nirvanesque.pa...@gmail.com>
> > > Date: Fri, Aug 29, 2014 at 1:57 PM
> > > Subject: Re: [stratosphere-dev] Spark comparison
> > > To: stratosphere-...@googlegroups.com
> > >
> > >
> > > Ufuk and the Flink team,
> > >
> > > You and your team are familiar by now with this comparison (Master
> > > thesis of Ze Ni in the KTH Institute)
> > > http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf
> > >
> > > I would like to know your viewpoints in this direction?
> > >
> > > Thanks in advance,
> > > Anirvan
> > >
> > >
> > >
> > > On Tuesday, December 3, 2013 6:19:57 PM UTC+1, Ufuk Celebi wrote:
> > >
> > > > Hey Ankur,
> > > >
> > > > I like the idea of a comparison matrix. We tried to do something
> > similar
> > > > with Hadoop already (parts of it are on the front page of our
> > > > website), which we used for a local summit here. Comparing
> > > > Stratosphere to Spark
> > in
> > > > this way would be a natural extension to this. ;-)
> > > >
> > > > Internally, we ran some benchmarks against 0.7.3 (unfortunately
> > > > right before the 0.8 release). We didn't publish the results as
> > > > there are
> > > certain
> > > > aspects that make the comparison unfair (for example we have no
> > > > fault tolerance right now whereas Spark does). As soon as we
> > > > (re-)introduce
> > > fault
> > > > tolerance mechanisms, we will re-run the benchmarks.
> > > >
> > > > I can publish the code for the Stratosphere and Spark programs we
> > looked
> > > > at on GitHub. If I add Scala versions of the Stratosphere
> > > > programs,
> > this
> > > > will also go to your proposed direction of having a direct
> comparison.
> > > >
> > > > Is there any specific use case where you want to see numbers? Or
> > > > is it more like a general thing where you want to see how both
> > > > systems
> > perform?
> > > >
> > > > Best,
> > > >
> > > > Ufuk
> > > >
> > > > On 03 Dec 2013, at 18:03, Ankur Chauhan <an...@malloc64.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > >
> > > > Sitting at spark-summit 2013, I was interested in figuring out if
> > anyone
> > > > has done a feature comparison and or benchmarks against
> > spark/storm/etc.
> > > > This may also serve as a "compatibility matrix" and would help a
> > > > lot
> > when
> > > > people want to compare the two projects and help us understand
> > > > what are
> > > the
> > > > strengths and weakness of each project.
> > > >
> > > > -- Ankur
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups
> > > > "stratosphere-dev" group.
> > > > To unsubscribe from this group and stop receiving emails from it,
> > > > send
> > an
> > > > email to stratosphere-d...@googlegroups.com.
> > > >
> > > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > > For more options, visit https://groups.google.com/groups/opt_out.
> > > >
> > > >
> > > >  --
> > > You received this message because you are subscribed to the Google
> > > Groups "stratosphere-dev" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to stratosphere-dev+unsubscr...@googlegroups.com.
> > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> >
>

Re: [stratosphere-dev] Spark comparison

Reply via email to