Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran

Note that even the Facebook four degrees of separation paper went down to a 
single machine running WebGraph (http://webgraph.di.unimi.it/) for the final 
steps, after running jobs in there Hadoop cluster to build the dataset for that 
final operation.

The computations were performed on a 24-core machine with 72 GiB of memory and 
1 TiB of disk space.6 The first task was to import the Facebook graph(s) into a 
compressed form for WebGraph [4], so that the multiple scans required by 
HyperANF’s diffusive process could be carried out relatively quickly.

Some toolkits/libraries are optimised for that single dedicated use —yet are 
downstream of the raw data; where memory reads $L1-$L3 cache locality becomes 
the main performance problem, and where synchronisation techniques like BSP 
aren't necessarily needed.




On 29 Mar 2015, at 23:18, Eran Medan 
ehrann.meh...@gmail.commailto:ehrann.meh...@gmail.com wrote:

Hi Sean,
I think your point about the ETL costs are the wining argument here. but I 
would like to see more research on the topic.

What I would like to see researched - is ability to run a specialized set of 
common algorithms in fast-local-mode just like a compiler optimizer can 
decide to inline some methods, or rewrite a recursive function as a for loop if 
it's in tail position, I would say that the future of GraphX can be that if a 
certain algorithm is a well known one (e.g. shortest paths) and can be run 
locally faster than on a distributed set (taking into account bringing all the 
data locally) then it will do so.

Thanks!

On Sat, Mar 28, 2015 at 1:34 AM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:
(I bet the Spark implementation could be improved. I bet GraphX could
be optimized.)

Not sure about this one, but in core benchmarks often start by
assuming that the data is local. In the real world, data is unlikely
to be. The benchmark has to include the cost of bringing all the data
to the local computation too, since the point of distributed
computation is bringing work to the data.

Specialist implementations for a special problem should always win
over generalist, and Spark is a generalist. Likewise you can factor
matrices way faster in a GPU than in Spark. These aren't entirely
either/or propositions; you can use Rust or GPU in a larger
distributed program.

Typically a real-world problem involves more than core computation:
ETL, security, monitoring. Generalists are more likely to have an
answer to hand for these.

Specialist implementations do just one thing, and they typically have
to be custom built. Compare the cost of highly skilled developer time
to generalist computing resources; $1m buys several dev years but also
rents a small data center.

Speed is an important issue but by no means everything in the real
world, and these are rarely mutually exclusive options in the OSS
world. This is a great piece of work, but I don't think it's some kind
of argument against distributed computing.


On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan 
ehrann.meh...@gmail.commailto:ehrann.meh...@gmail.com wrote:
 Remember that article that went viral on HN? (Where a guy showed how GraphX
 / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on
 a 1 thread machine? if not here is the article
 -http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)


 Well as you may recall, this stirred up a lot of commotion in the big data
 community (and Spark/GraphX in particular)

 People (justly I guess) blamed him for not really having “big data”, as all
 of his data set fits in memory, so it doesn't really count.


 So he took the challenge and came with a pretty hard to argue counter
 benchmark, now with a huge data set (1TB of data, encoded using Hilbert
 curves to 154GB, but still large).
 see at -
 http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html

 He provided the source here https://github.com/frankmcsherry/COST as an
 example

 His benchmark shows how on a 128 billion edges graph, he got X2 to X10
 faster results on a single threaded Rust based implementation

 So, what is the counter argument? it pretty much seems like a blow in the
 face of Spark / GraphX etc, (which I like and use on a daily basis)

 Before I dive into re-validating his benchmarks with my own use cases. What
 is your opinion on this? If this is the case, then what IS the use case for
 using Spark/GraphX at all?




Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that you can't rely on memory in distributed analytics...now
maybe we are challenging the assumption that big data analytics need to
distributed?

I've been asking the same question lately and seen similarly that spark
performs quite reliably and well on local single node system even for an
app which I ran for a streaming app which I ran for ten days in a row...  I
almost felt guilty that I never put it on a cluster!
On Mar 30, 2015 5:51 AM, Steve Loughran ste...@hortonworks.com wrote:


  Note that even the Facebook four degrees of separation paper went down
 to a single machine running WebGraph (http://webgraph.di.unimi.it/) for
 the final steps, after running jobs in there Hadoop cluster to build the
 dataset for that final operation.

  The computations were performed on a 24-core machine with 72 GiB of
 memory and 1 TiB of disk space.6 The first task was to import the Facebook
 graph(s) into a compressed form for WebGraph [4], so that the multiple
 scans required by HyperANF’s diffusive process could be carried out
 relatively quickly.

  Some toolkits/libraries are optimised for that single dedicated use —yet
 are downstream of the raw data; where memory reads $L1-$L3 cache locality
 becomes the main performance problem, and where synchronisation techniques
 like BSP aren't necessarily needed.




  On 29 Mar 2015, at 23:18, Eran Medan ehrann.meh...@gmail.com wrote:

  Hi Sean,
 I think your point about the ETL costs are the wining argument here. but I
 would like to see more research on the topic.

 What I would like to see researched - is ability to run a specialized set
 of common algorithms in fast-local-mode just like a compiler optimizer
 can decide to inline some methods, or rewrite a recursive function as a for
 loop if it's in tail position, I would say that the future of GraphX can be
 that if a certain algorithm is a well known one (e.g. shortest paths) and
 can be run locally faster than on a distributed set (taking into account
 bringing all the data locally) then it will do so.

  Thanks!

 On Sat, Mar 28, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote:

 (I bet the Spark implementation could be improved. I bet GraphX could
 be optimized.)

 Not sure about this one, but in core benchmarks often start by
 assuming that the data is local. In the real world, data is unlikely
 to be. The benchmark has to include the cost of bringing all the data
 to the local computation too, since the point of distributed
 computation is bringing work to the data.

 Specialist implementations for a special problem should always win
 over generalist, and Spark is a generalist. Likewise you can factor
 matrices way faster in a GPU than in Spark. These aren't entirely
 either/or propositions; you can use Rust or GPU in a larger
 distributed program.

 Typically a real-world problem involves more than core computation:
 ETL, security, monitoring. Generalists are more likely to have an
 answer to hand for these.

 Specialist implementations do just one thing, and they typically have
 to be custom built. Compare the cost of highly skilled developer time
 to generalist computing resources; $1m buys several dev years but also
 rents a small data center.

 Speed is an important issue but by no means everything in the real
 world, and these are rarely mutually exclusive options in the OSS
 world. This is a great piece of work, but I don't think it's some kind
 of argument against distributed computing.


 On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan ehrann.meh...@gmail.com
 wrote:
  Remember that article that went viral on HN? (Where a guy showed how
 GraphX
  / Giraph / GraphLab / Spark have worse performance on a 128 cluster
 than on
  a 1 thread machine? if not here is the article
  -
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
 
 
  Well as you may recall, this stirred up a lot of commotion in the big
 data
  community (and Spark/GraphX in particular)
 
  People (justly I guess) blamed him for not really having “big data”, as
 all
  of his data set fits in memory, so it doesn't really count.
 
 
  So he took the challenge and came with a pretty hard to argue counter
  benchmark, now with a huge data set (1TB of data, encoded using Hilbert
  curves to 154GB, but still large).
  see at -
 
 http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
 
  He provided the source here https://github.com/frankmcsherry/COST as an
  example
 
  His benchmark shows how on a 128 billion edges graph, he got X2 to X10
  faster results on a single threaded Rust based implementation
 
  So, what is the counter argument? it pretty much seems like a blow in
 the
  face of Spark / GraphX etc, (which I like and use on a daily basis)
 
  Before I dive into re-validating his benchmarks with my own use cases.
 What
  is your opinion on this? If this is the case, then what IS the use case
 for
  

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran

On 30 Mar 2015, at 13:27, jay vyas 
jayunit100.apa...@gmail.commailto:jayunit100.apa...@gmail.com wrote:


Just the same as spark was disrupting the hadoop ecosystem by changing the 
assumption that you can't rely on memory in distributed analytics...now maybe 
we are challenging the assumption that big data analytics need to distributed?

I've been asking the same question lately and seen similarly that spark 
performs quite reliably and well on local single node system even for an app 
which I ran for a streaming app which I ran for ten days in a row...  I almost 
felt guilty that I never put it on a cluster!

Modern machines can be pretty powerful: 16 physical cores HT'd to 32, 384+MB, 
GPU, giving you lots of compute. What you don't get is the storage capacity to 
match, and especially, the IO bandwidth. RAID-0 striping 2-4 HDDs gives you 
some boost, but if you are reading, say, a 4 GB file from HDFS broken in to 
256MB blocks, you have that data  replicated into (4*4*3) blocks: 48. Algorithm 
and capacity permitting, you've just massively boosted your load time. 
Downstream, if data can be thinned down, then you can start looking more at 
things you can do on a single host : a machine that can be in your Hadoop 
cluster. Ask YARN nicely and you can get a dedicated machine for a couple of 
days (i.e. until your Kerberos tokens expire).



Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple
of TeraBytes is not that challenging (depending on the algorithm) these
days where as 5 years ago it was a big challenge. We have a bit over a
PetaByte (not using Spark)  and using a distributed system is the only
viable way to get reasonable performance for reasonable cost

cheers

On Tue, Mar 31, 2015 at 4:55 AM, Steve Loughran ste...@hortonworks.com
wrote:


  On 30 Mar 2015, at 13:27, jay vyas jayunit100.apa...@gmail.com wrote:

  Just the same as spark was disrupting the hadoop ecosystem by changing
 the assumption that you can't rely on memory in distributed
 analytics...now maybe we are challenging the assumption that big data
 analytics need to distributed?

 I've been asking the same question lately and seen similarly that spark
 performs quite reliably and well on local single node system even for an
 app which I ran for a streaming app which I ran for ten days in a row...  I
 almost felt guilty that I never put it on a cluster!


  Modern machines can be pretty powerful: 16 physical cores HT'd to 32,
 384+MB, GPU, giving you lots of compute. What you don't get is the storage
 capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4
 HDDs gives you some boost, but if you are reading, say, a 4 GB file from
 HDFS broken in to 256MB blocks, you have that data  replicated into (4*4*3)
 blocks: 48. Algorithm and capacity permitting, you've just massively
 boosted your load time. Downstream, if data can be thinned down, then you
 can start looking more at things you can do on a single host : a machine
 that can be in your Hadoop cluster. Ask YARN nicely and you can get a
 dedicated machine for a couple of days (i.e. until your Kerberos tokens
 expire).




-- 

*Franc Carter* I  Systems ArchitectI RoZetta Technology



[image: Description: Description: Description:
cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

*T*  +61 2 8355 2515 Iwww.rozettatechnology.com

[image: cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be
legally

privileged and confidential. Any unauthorised use of the contents is
expressly prohibited.


Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-29 Thread Eran Medan
Hi Sean,
I think your point about the ETL costs are the wining argument here. but I
would like to see more research on the topic.

What I would like to see researched - is ability to run a specialized set
of common algorithms in fast-local-mode just like a compiler optimizer
can decide to inline some methods, or rewrite a recursive function as a for
loop if it's in tail position, I would say that the future of GraphX can be
that if a certain algorithm is a well known one (e.g. shortest paths) and
can be run locally faster than on a distributed set (taking into account
bringing all the data locally) then it will do so.

Thanks!

On Sat, Mar 28, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote:

 (I bet the Spark implementation could be improved. I bet GraphX could
 be optimized.)

 Not sure about this one, but in core benchmarks often start by
 assuming that the data is local. In the real world, data is unlikely
 to be. The benchmark has to include the cost of bringing all the data
 to the local computation too, since the point of distributed
 computation is bringing work to the data.

 Specialist implementations for a special problem should always win
 over generalist, and Spark is a generalist. Likewise you can factor
 matrices way faster in a GPU than in Spark. These aren't entirely
 either/or propositions; you can use Rust or GPU in a larger
 distributed program.

 Typically a real-world problem involves more than core computation:
 ETL, security, monitoring. Generalists are more likely to have an
 answer to hand for these.

 Specialist implementations do just one thing, and they typically have
 to be custom built. Compare the cost of highly skilled developer time
 to generalist computing resources; $1m buys several dev years but also
 rents a small data center.

 Speed is an important issue but by no means everything in the real
 world, and these are rarely mutually exclusive options in the OSS
 world. This is a great piece of work, but I don't think it's some kind
 of argument against distributed computing.


 On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan ehrann.meh...@gmail.com
 wrote:
  Remember that article that went viral on HN? (Where a guy showed how
 GraphX
  / Giraph / GraphLab / Spark have worse performance on a 128 cluster than
 on
  a 1 thread machine? if not here is the article
  -
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
 
 
  Well as you may recall, this stirred up a lot of commotion in the big
 data
  community (and Spark/GraphX in particular)
 
  People (justly I guess) blamed him for not really having “big data”, as
 all
  of his data set fits in memory, so it doesn't really count.
 
 
  So he took the challenge and came with a pretty hard to argue counter
  benchmark, now with a huge data set (1TB of data, encoded using Hilbert
  curves to 154GB, but still large).
  see at -
 
 http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
 
  He provided the source here https://github.com/frankmcsherry/COST as an
  example
 
  His benchmark shows how on a 128 billion edges graph, he got X2 to X10
  faster results on a single threaded Rust based implementation
 
  So, what is the counter argument? it pretty much seems like a blow in the
  face of Spark / GraphX etc, (which I like and use on a daily basis)
 
  Before I dive into re-validating his benchmarks with my own use cases.
 What
  is your opinion on this? If this is the case, then what IS the use case
 for
  using Spark/GraphX at all?



Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Jörn Franke
Hallo,

Well all problems you want to solve with technology need to have good
justification for a certain technology. So the first thing is that you ask
which technology fits to my current and future problems. This is also what
the article says. Unfortunately, it does only provide a vague answer why
there is this performance gap. Is it a Spark architecture issue? Is it a
configuration issue? Is it a design issue of the spark version of the
algorithms? Is it an amazon issue? Why did he use a laptop and not a single
Amazon machine to compare? Why did he not run multiple threads on a single
machine (for some problems single thread might be the fastest solution
anyway)?

Based on my experience a single machine can be already quiet useful for
graph algorithms. There are also different graph systems all for different
purposes. Spark Graphx is more general (can be used in combination with the
whole Spark Plattform!) and probably less performant than highly specialed
graph systems leveraging GPU etc. - These systems have the disadvantage
that they are not generally suitable or integrated with other types of
processing, such as streaming, mr, rdd, etc.

I am always curios for any technology why and where do one looses
performance. That's why one does proof-of-concepts and evaluates technology
depending on the business case. Maybe the article is right, but it is
unclear if it can be generalized or if it really has an impact of your
business case for Spark/Graphx. His algorithms can only do graph processing
for a very special case and are not suitable for a general all-purpose big
data infrastructure.

Best regards
 Le 27 mars 2015 19:33, Eran Medan ehrann.meh...@gmail.com a écrit :

 Remember that article that went viral on HN? (Where a guy showed how
 GraphX / Giraph / GraphLab / Spark have worse performance on a 128 cluster
 than on a 1 thread machine? if not here is the article -
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)


 Well as you may recall, this stirred up a lot of commotion in the big data
 community (and Spark/GraphX in particular)

 People (justly I guess) blamed him for not really having “big data”, as
 all of his data set fits in memory, so it doesn't really count.


 So he took the challenge and came with a pretty hard to argue counter
 benchmark, now with a huge data set (1TB of data, encoded using Hilbert
 curves to 154GB, but still large).
 see at -
 http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html

 He provided the source here https://github.com/frankmcsherry/COST as an
 example

 His benchmark shows how on a 128 billion edges graph, he got X2 to X10
 faster results on a single threaded Rust based implementation

 So, what is the counter argument? it pretty much seems like a blow in the
 face of Spark / GraphX etc, (which I like and use on a daily basis)

 Before I dive into re-validating his benchmarks with my own use cases.
 What is your opinion on this? If this is the case, then what IS the use
 case for using Spark/GraphX at all?



Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Eran Medan
Remember that article that went viral on HN? (Where a guy showed how GraphX
/ Giraph / GraphLab / Spark have worse performance on a 128 cluster than on
a 1 thread machine? if not here is the article -
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)


Well as you may recall, this stirred up a lot of commotion in the big data
community (and Spark/GraphX in particular)

People (justly I guess) blamed him for not really having “big data”, as all
of his data set fits in memory, so it doesn't really count.


So he took the challenge and came with a pretty hard to argue counter
benchmark, now with a huge data set (1TB of data, encoded using Hilbert
curves to 154GB, but still large).
see at -
http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html

He provided the source here https://github.com/frankmcsherry/COST as an
example

His benchmark shows how on a 128 billion edges graph, he got X2 to X10
faster results on a single threaded Rust based implementation

So, what is the counter argument? it pretty much seems like a blow in the
face of Spark / GraphX etc, (which I like and use on a daily basis)

Before I dive into re-validating his benchmarks with my own use cases. What
is your opinion on this? If this is the case, then what IS the use case for
using Spark/GraphX at all?


Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Sean Owen
(I bet the Spark implementation could be improved. I bet GraphX could
be optimized.)

Not sure about this one, but in core benchmarks often start by
assuming that the data is local. In the real world, data is unlikely
to be. The benchmark has to include the cost of bringing all the data
to the local computation too, since the point of distributed
computation is bringing work to the data.

Specialist implementations for a special problem should always win
over generalist, and Spark is a generalist. Likewise you can factor
matrices way faster in a GPU than in Spark. These aren't entirely
either/or propositions; you can use Rust or GPU in a larger
distributed program.

Typically a real-world problem involves more than core computation:
ETL, security, monitoring. Generalists are more likely to have an
answer to hand for these.

Specialist implementations do just one thing, and they typically have
to be custom built. Compare the cost of highly skilled developer time
to generalist computing resources; $1m buys several dev years but also
rents a small data center.

Speed is an important issue but by no means everything in the real
world, and these are rarely mutually exclusive options in the OSS
world. This is a great piece of work, but I don't think it's some kind
of argument against distributed computing.


On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan ehrann.meh...@gmail.com wrote:
 Remember that article that went viral on HN? (Where a guy showed how GraphX
 / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on
 a 1 thread machine? if not here is the article
 -http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)


 Well as you may recall, this stirred up a lot of commotion in the big data
 community (and Spark/GraphX in particular)

 People (justly I guess) blamed him for not really having “big data”, as all
 of his data set fits in memory, so it doesn't really count.


 So he took the challenge and came with a pretty hard to argue counter
 benchmark, now with a huge data set (1TB of data, encoded using Hilbert
 curves to 154GB, but still large).
 see at -
 http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html

 He provided the source here https://github.com/frankmcsherry/COST as an
 example

 His benchmark shows how on a 128 billion edges graph, he got X2 to X10
 faster results on a single threaded Rust based implementation

 So, what is the counter argument? it pretty much seems like a blow in the
 face of Spark / GraphX etc, (which I like and use on a daily basis)

 Before I dive into re-validating his benchmarks with my own use cases. What
 is your opinion on this? If this is the case, then what IS the use case for
 using Spark/GraphX at all?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org