ShreyanshB <shreyanshpbh...@gmail.com> writes:
>> The version with in-memory shuffle is here:
>> https://github.com/amplab/graphx2/commits/vldb.
>
> It'd be great if you can tell me how to configure and invoke this spark
> version.

Sorry for the delay on this. Assuming you're planning to launch an EC2 cluster, 
here's how to use the version of GraphX with in-memory shuffle:

1. Check out the in-memory shuffle branch locally. It's important to do this 
before launching the cluster to make sure the cluster gets set up in a way 
that's compatible with this version of Spark (using the v2 branch of 
https://github.com/mesos/spark-ec2).

    git clone https://github.com/amplab/graphx2 -b vldb
    mv graphx2 spark

2. Launch a cluster.

    cd spark
    ec2/spark-ec2 -s 16 -w 500 -k ec2-key-name -i path/to/ec2-key.pem -t 
m2.4xlarge -z us-east-1e --spot-price=1 launch graphx-benchmarking

3. On the cluster, check out and build the in-memory shuffle branch.

    cd /mnt
    git clone https://github.com/amplab/graphx2 -b vldb
    mv graphx2 spark
    cd spark
    mkdir -p conf
    cp ~/spark/conf/* conf/
    sbt/sbt assembly
    rsync -r --delete . ~/spark
    ~/spark/sbin/stop-all.sh
    ~/spark-ec2/copy-dir --delete ~/spark
    ~/spark/sbin/start-all.sh

3. Load your input graph onto HDFS in edge list format.

    ~/ephemeral-hdfs/bin/hadoop fs -put edge-list.txt /

4. Run PageRank using the Analytics driver.

    cd ~/spark
    MASTER=spark://$(wget -q -O - 
http://169.254.169.254/latest/meta-data/public-hostname):7077
    /usr/bin/time -f "TOTAL TIME: %e seconds" ~/spark/bin/spark-class 
org.apache.spark.graphx.lib.Analytics $MASTER pagerank /edge-list.txt 
--numEPart=128 --numIter=10

Ankur

Reply via email to