Re: Graphx : Perfomance comparison over cluster

2014-07-23 Thread ShreyanshB
Thanks Ankur.

The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're interested in benchmarking it.

It'd be great if you can tell me how to configure and invoke this spark
version.



On Sun, Jul 20, 2014 at 9:02 PM, ankurdave [via Apache Spark User List] 
ml-node+s1001560n10281...@n3.nabble.com wrote:

 On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB [hidden email]
 http://user/SendEmail.jtp?type=nodenode=10281i=0 wrote:

 Does the suggested version with in-memory shuffle affects performance too
 much?


 We've observed a 2-3x speedup from it, at least on larger graphs like
 twitter-2010 http://law.di.unimi.it/webdata/twitter-2010/ and uk-2007-05
 http://law.di.unimi.it/webdata/uk-2007-05/.

 (according to previously reported numbers, graphx did 10 iterations in 142
 seconds and in latest stats it does it in 68 seconds). Is it just the
 in-memory version which is changed?


 If you're referring to previous results vs. the arXiv paper, there were
 several improvements, but in-memory shuffle had the largest impact.

 Ankur http://www.ankurdave.com/


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222p10281.html
  To unsubscribe from Graphx : Perfomance comparison over cluster, click
 here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=10222code=c2hyZXlhbnNocGJoYXR0QGdtYWlsLmNvbXwxMDIyMnwtMTc5NzgyNjk5NQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222p10523.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Graphx : Perfomance comparison over cluster

2014-07-18 Thread ShreyanshB
Thanks a lot Ankur.

The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're interested in benchmarking it.

Actually I wanted to see how graphlab and graphx performs for the cluster
we have (32 cores per node and infinite band). I tried the live journal
graph with partitions = 400 (16 nodes and each node with 32 cores). but it
performed better with partition=64. I'll try it again. Does the suggested
version with in-memory shuffle affects performance too much? (according to
previously reported numbers, graphx did 10 iterations in 142 seconds and in
latest stats it does it in 68 seconds). Is it just the in-memory version
which is changed?





On Fri, Jul 18, 2014 at 8:31 PM, ankurdave [via Apache Spark User List] 
ml-node+s1001560n10227...@n3.nabble.com wrote:

 Thanks for your interest. I should point out that the numbers in the arXiv
 paper are from GraphX running on top of a custom version of Spark with an
 experimental in-memory shuffle prototype. As a result, if you benchmark
 GraphX at the current master, it's expected that it will be 2-3x slower
 than GraphLab.

 The version with in-memory shuffle is here:
 https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
 changed a lot since then, and the way to configure and invoke Spark is
 different. I can send you the correct configuration/invocation for this if
 you're interested in benchmarking it.

 On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB [hidden email]
 http://user/SendEmail.jtp?type=nodenode=10227i=0 wrote:

 Should I use the pagerank application already available in graphx for
 this purpose or need to modify or need to write my own?


 You should use the built-in PageRank. If your graph is available in edge
 list format, you can run it using the Analytics driver as follows:

 ~/spark/bin/spark-submit --master spark://$MASTER_URL:7077 --class
 org.apache.spark.graphx.lib.Analytics
 ~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
 pagerank $EDGE_FILE --numEPart=$NUM_PARTITIONS --numIter=$NUM_ITERATIONS
 [--partStrategy=$PARTITION_STRATEGY]

 What should be the executor_memory, i.e. maximum or according to graph
 size?


 As much memory as possible while leaving room for the operating system.

 Is there any other configuration I should do to have the best performance?


 I think the parameters to Analytics above should be sufficient:

 - numEPart - should be equal to or a small integer multiple of the number
 of cores. More partitions improve work balance but also increase memory
 usage and communication, so in some cases it can even be faster with fewer
 partitions than cores.
 - partStrategy - If your edges are already sorted, you can skip this
 option, because GraphX will leave them as-is by default and that may be
 close to optimal. Otherwise, EdgePartition2D and RandomVertexCut are both
 worth trying.

 CC'ing Joey and Dan, who may have other suggestions.

 Ankur http://www.ankurdave.com/


 On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB [hidden email]
 http://user/SendEmail.jtp?type=nodenode=10227i=1 wrote:

 Hi,

 I am trying to compare Graphx and other distributed graph processing
 systems
 (graphlab) on my cluster of 64 nodes, each node having 32 cores and
 connected with infinite band.

 I looked at http://arxiv.org/pdf/1402.2394.pdf and stats provided over
 there. I had few questions regarding configuration and achieving best
 performance.

 * Should I use the pagerank application already available in graphx for
 this
 purpose or need to modify or need to write my own?
- If I shouldn't use the inbuilt pagerank, can you share your pagerank
 application?

 * What should be the executor_memory, i.e. maximum or according to graph
 size?

 * Other than, number of cores, executor_memory and partition strategy, Is
 there any other configuration I should do to have the best performance?

 I am using following script,
 import org.apache.spark._
 import org.apache.spark.graphx._
 import org.apache.spark.rdd.RDD

 val startgraphloading = System.currentTimeMillis;
 val graph = GraphLoader.edgeListFile(sc, filepath,true,32)
 val endgraphloading = System.currentTimeMillis;


 Thanks in advance :)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222p10227.html
  To unsubscribe from Graphx : Perfomance comparison over cluster, click
 here
 http://apache-spark-user-list.1001560.n3.nabble.com/template

Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Hi,

I am trying graphx on live journal data. I have a cluster of 17 computing
nodes, 1 master and 16 workers. I had few questions about this. 
* I built spark from spark-master (to avoid partitionBy error of spark 1.0). 
* I am using edgeFileList() to load data and I figured I need to provide
partitions I want. the exact syntax I am using is following
val graph = GraphLoader.edgeListFile(sc,
filepath,true,64).partitionBy(PartitionStrategy.RandomVertexCut)

-- Is it a correct way to load file to get best performance?
-- What should be the partition size? =computing node or =cores?
-- I see following error so many times in my logs, 
ERROR BlockManagerWorker: Exception handling buffer message
java.io.NotSerializableException:
org.apache.spark.graphx.impl.ShippableVertexPartition
Does it suggest that my graph wasn't partitioned properly? I suspect it
affects performance ?

Please suggest whether I'm following every step (correctly)

Thanks in advance,
-Shreyansh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Thanks a lot Ankur, I'll follow that.

A last quick
Does that error affect performance?

~Shreyansh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9462.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Great! Thanks a lot. 
Hate to say this but I promise this is last quickie

I looked at the configurations but I didn't find any parameter to tune for
network bandwidth i.e. Is there anyway to tell graphx (spark) that I'm using
1G network or 10G network or infinite band? Does it figure out on its own
and speed up message passing accordingly?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Perfect! Thanks Ankur.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.