AmpLab Big Data Benchmark for Spark error on EC2
I am trying to run the Big Data benchmark <https://amplab.cs.berkeley.edu/benchmark/> on my EC2 cluster for my own Spark fork of version 1.5. It just modifies some files on the Spark core. My cluster contains 1 master and 2 slave nodes of type m1.large. I use the ec2 scripts bundled with Spark to launch my cluster. The cluster launched perfectly and I am able to successfully ssh into the master. However when I try to run the benchmarks from the master using the command ./runner/prepare-benchmark.sh --shark --aws-key-id= --aws-key= --shark-host= --shark-identity-file=/root/.ssh/id_rsa --scale-factor=1 I get the following error: === IMPORTING BENCHMARK DATA FROM S3 === bash: /root/ephemeral-hdfs/bin/hdfs: No such file or directory Connection to ec2-54-201-169-165.us-west-2.compute.amazonaws.com closed. bash: /root/mapreduce/bin/start-mapred.sh: No such file or directory Connection to ec2-54-201-169-165.us-west-2.compute.amazonaws.com closed. Traceback (most recent call last): File "./prepare_benchmark.py", line 606, in main() File "./prepare_benchmark.py", line 594, in main prepare_shark_dataset(opts) File "./prepare_benchmark.py", line 192, in prepare_shark_dataset ssh_shark("/root/mapreduce/bin/start-mapred.sh") File "./prepare_benchmark.py", line 180, in ssh_shark ssh(opts.shark_host, "root", opts.shark_identity_file, command) File "./prepare_benchmark.py", line 139, in ssh (identity_file, username, host, command), shell=True) File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa r...@ec2-54-201-169-165.us-west-2.compute.amazonaws.com 'source /root/.bash_profile; /root/mapreduce/bin/start-mapred.sh'' returned non-zero exit status 127 have tried terminating the cluster and launching it again multiples times but the problem persists. What could be the issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AmpLab-Big-Data-Benchmark-for-Spark-error-on-EC2-tp26207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Amplab: big-data-benchmark
Hi Burak,Thanks, I will then start benchmarking the cluster. > Date: Wed, 27 Aug 2014 11:52:05 -0700 > From: bya...@stanford.edu > To: ssti...@live.com > CC: user@spark.apache.org > Subject: Re: Amplab: big-data-benchmark > > Hi Sameer, > > I've faced this issue before. They don't show up on > http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: > `sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")` > The gotcha is that you also need to supply which dataset you want: crawl, > uservisits, or rankings in lower case after the format and size you want them > in. > They should be there. > > Best, > Burak > > - Original Message - > From: "Sameer Tilak" > To: user@spark.apache.org > Sent: Wednesday, August 27, 2014 11:42:28 AM > Subject: Amplab: big-data-benchmark > > Hi All, > I am planning to run amplab benchmark suite to evaluate the performance of > our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it > mentions about data avallability at: > s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where > /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able > to doanload these datasets directly. Here is what I see. I read that they can > be used directly by doing : sc.textFile(s3:/). However, I wanted to make > sure that my understanding is correct. Here is what I see at > http://s3.amazonaws.com/big-data-benchmark/ > I do not see anything for sequence or text-deflate. > I see sequence-snappy dataset: > pavlo/sequence-snappy/5nodes/crawl/000738_02013-05-27T21:26:40.000Z"a978d18721d5a533d38a88f558461644"42958735STANDARD > For text, I get the following error: > NoSuchKeyThe specified key does not > exist.pavlo/text/1node/crawl166D239D383995264Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI > > Please let me know if there is a way to readily download the dataset and view > it. > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
Re: Amplab: big-data-benchmark
Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings in lower case after the format and size you want them in. They should be there. Best, Burak - Original Message - From: "Sameer Tilak" To: user@spark.apache.org Sent: Wednesday, August 27, 2014 11:42:28 AM Subject: Amplab: big-data-benchmark Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to doanload these datasets directly. Here is what I see. I read that they can be used directly by doing : sc.textFile(s3:/). However, I wanted to make sure that my understanding is correct. Here is what I see at http://s3.amazonaws.com/big-data-benchmark/ I do not see anything for sequence or text-deflate. I see sequence-snappy dataset: pavlo/sequence-snappy/5nodes/crawl/000738_02013-05-27T21:26:40.000Z"a978d18721d5a533d38a88f558461644"42958735STANDARD For text, I get the following error: NoSuchKeyThe specified key does not exist.pavlo/text/1node/crawl166D239D383995264Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI Please let me know if there is a way to readily download the dataset and view it. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Amplab: big-data-benchmark
Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to doanload these datasets directly. Here is what I see. I read that they can be used directly by doing : sc.textFile(s3:/). However, I wanted to make sure that my understanding is correct. Here is what I see at http://s3.amazonaws.com/big-data-benchmark/ I do not see anything for sequence or text-deflate. I see sequence-snappy dataset: pavlo/sequence-snappy/5nodes/crawl/000738_02013-05-27T21:26:40.000Z"a978d18721d5a533d38a88f558461644"42958735STANDARD For text, I get the following error: NoSuchKeyThe specified key does not exist.pavlo/text/1node/crawl166D239D383995264Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI Please let me know if there is a way to readily download the dataset and view it.