Terasort on Spark

Du, Fan Tue, 10 Nov 2015 21:38:26 -0800

Hi Spark experts

I'm using ehiggs/spark-terasort to exercise my cluster.

I don't understand how to run the terasort in a standard way when usingcluster.

Currently, all the input data and output data is put into hdfs, and Ican generate/sort/validate

all the sample data.But I'm not sure it's the right way to do it.

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGenspark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 256ghdfs:///tmp/data_in./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGenspark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 256ghdfs:///tmp/data_in./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidatespark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jarhdfs:///test/data_out file:///tmp/data_validate

That's being said, all the 256G input data is stored in hdfs,and mesosslave needs to access the hdfs based input data.

So this leads into another question on how hdfs is setup in a standard way.

Is there any docs to summarize how to setup a standard runtime env forterasort on Spark?


thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Terasort on Spark

Reply via email to