Hi Spark experts
I'm using ehiggs/spark-terasort to exercise my cluster.
I don't understand how to run the terasort in a standard way when using
cluster.
Currently, all the input data and output data is put into hdfs, and I
can generate/sort/validate
all the sample data.But I'm not sure it's the right way to do it.
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen
spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 256g
hdfs:///tmp/data_in
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen
spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 256g
hdfs:///tmp/data_in
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate
spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
hdfs:///test/data_out file:///tmp/data_validate
That's being said, all the 256G input data is stored in hdfs,and mesos
slave needs to access the hdfs based input data.
So this leads into another question on how hdfs is setup in a standard way.
Is there any docs to summarize how to setup a standard runtime env for
terasort on Spark?
thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org