Hi Spark experts

I'm using ehiggs/spark-terasort to exercise my cluster.
I don't understand how to run the terasort in a standard way when using cluster.


Currently, all the input data and output data is put into hdfs, and I can generate/sort/validate
all the sample data.But I'm not sure it's the right way to do it.

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 256g hdfs:///tmp/data_in ./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 256g hdfs:///tmp/data_in ./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar hdfs:///test/data_out file:///tmp/data_validate

That's being said, all the 256G input data is stored in hdfs,and mesos slave needs to access the hdfs based input data.
So this leads into another question on how hdfs is setup in a standard way.

Is there any docs to summarize how to setup a standard runtime env for terasort on Spark?

thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to