Hi Sean,
I think your point about the ETL costs are the wining argument here. but I
would like to see more research on the topic.

What I would like to see researched - is ability to run a specialized set
of common algorithms in "fast-local-mode" just like a compiler optimizer
can decide to inline some methods, or rewrite a recursive function as a for
loop if it's in tail position, I would say that the future of GraphX can be
that if a certain algorithm is a well known one (e.g. shortest paths) and
can be run locally faster than on a distributed set (taking into account
bringing all the data locally) then it will do so.

Thanks!

On Sat, Mar 28, 2015 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:

> (I bet the Spark implementation could be improved. I bet GraphX could
> be optimized.)
>
> Not sure about this one, but "in core" benchmarks often start by
> assuming that the data is local. In the real world, data is unlikely
> to be. The benchmark has to include the cost of bringing all the data
> to the local computation too, since the point of distributed
> computation is bringing work to the data.
>
> Specialist implementations for a special problem should always win
> over generalist, and Spark is a generalist. Likewise you can factor
> matrices way faster in a GPU than in Spark. These aren't entirely
> either/or propositions; you can use Rust or GPU in a larger
> distributed program.
>
> Typically a real-world problem involves more than core computation:
> ETL, security, monitoring. Generalists are more likely to have an
> answer to hand for these.
>
> Specialist implementations do just one thing, and they typically have
> to be custom built. Compare the cost of highly skilled developer time
> to generalist computing resources; $1m buys several dev years but also
> rents a small data center.
>
> Speed is an important issue but by no means everything in the real
> world, and these are rarely mutually exclusive options in the OSS
> world. This is a great piece of work, but I don't think it's some kind
> of argument against distributed computing.
>
>
> On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan <ehrann.meh...@gmail.com>
> wrote:
> > Remember that article that went viral on HN? (Where a guy showed how
> GraphX
> > / Giraph / GraphLab / Spark have worse performance on a 128 cluster than
> on
> > a 1 thread machine? if not here is the article
> > -
> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
> >
> >
> > Well as you may recall, this stirred up a lot of commotion in the big
> data
> > community (and Spark/GraphX in particular)
> >
> > People (justly I guess) blamed him for not really having “big data”, as
> all
> > of his data set fits in memory, so it doesn't really count.
> >
> >
> > So he took the challenge and came with a pretty hard to argue counter
> > benchmark, now with a huge data set (1TB of data, encoded using Hilbert
> > curves to 154GB, but still large).
> > see at -
> >
> http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
> >
> > He provided the source here https://github.com/frankmcsherry/COST as an
> > example
> >
> > His benchmark shows how on a 128 billion edges graph, he got X2 to X10
> > faster results on a single threaded Rust based implementation
> >
> > So, what is the counter argument? it pretty much seems like a blow in the
> > face of Spark / GraphX etc, (which I like and use on a daily basis)
> >
> > Before I dive into re-validating his benchmarks with my own use cases.
> What
> > is your opinion on this? If this is the case, then what IS the use case
> for
> > using Spark/GraphX at all?
>

Reply via email to