ShreyanshB <shreyanshpbh...@gmail.com> writes: >> The version with in-memory shuffle is here: >> https://github.com/amplab/graphx2/commits/vldb. > > It'd be great if you can tell me how to configure and invoke this spark > version.
Sorry for the delay on this. Assuming you're planning to launch an EC2 cluster, here's how to use the version of GraphX with in-memory shuffle: 1. Check out the in-memory shuffle branch locally. It's important to do this before launching the cluster to make sure the cluster gets set up in a way that's compatible with this version of Spark (using the v2 branch of https://github.com/mesos/spark-ec2). git clone https://github.com/amplab/graphx2 -b vldb mv graphx2 spark 2. Launch a cluster. cd spark ec2/spark-ec2 -s 16 -w 500 -k ec2-key-name -i path/to/ec2-key.pem -t m2.4xlarge -z us-east-1e --spot-price=1 launch graphx-benchmarking 3. On the cluster, check out and build the in-memory shuffle branch. cd /mnt git clone https://github.com/amplab/graphx2 -b vldb mv graphx2 spark cd spark mkdir -p conf cp ~/spark/conf/* conf/ sbt/sbt assembly rsync -r --delete . ~/spark ~/spark/sbin/stop-all.sh ~/spark-ec2/copy-dir --delete ~/spark ~/spark/sbin/start-all.sh 3. Load your input graph onto HDFS in edge list format. ~/ephemeral-hdfs/bin/hadoop fs -put edge-list.txt / 4. Run PageRank using the Analytics driver. cd ~/spark MASTER=spark://$(wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname):7077 /usr/bin/time -f "TOTAL TIME: %e seconds" ~/spark/bin/spark-class org.apache.spark.graphx.lib.Analytics $MASTER pagerank /edge-list.txt --numEPart=128 --numIter=10 Ankur