AmpLab Big Data Benchmark for Spark error on EC2

2016-02-11 Thread cheez
I am trying to run the Big Data  benchmark
<https://amplab.cs.berkeley.edu/benchmark/>   on my EC2 cluster for my own
Spark fork of version 1.5. It just modifies some files on the Spark core. My
cluster contains 1 master and 2 slave nodes of type m1.large. I use the ec2
scripts bundled with Spark to launch my cluster. The cluster launched
perfectly and I am able to successfully ssh into the master. However when I
try to run the benchmarks from the master using the command

./runner/prepare-benchmark.sh --shark --aws-key-id=
--aws-key= --shark-host=
--shark-identity-file=/root/.ssh/id_rsa --scale-factor=1

I get the following error:

=== IMPORTING BENCHMARK DATA FROM S3 ===
bash: /root/ephemeral-hdfs/bin/hdfs: No such file or directory
Connection to ec2-54-201-169-165.us-west-2.compute.amazonaws.com closed.
bash: /root/mapreduce/bin/start-mapred.sh: No such file or directory
Connection to ec2-54-201-169-165.us-west-2.compute.amazonaws.com closed.
Traceback (most recent call last):
  File "./prepare_benchmark.py", line 606, in 
main()
  File "./prepare_benchmark.py", line 594, in main
prepare_shark_dataset(opts)
  File "./prepare_benchmark.py", line 192, in prepare_shark_dataset
ssh_shark("/root/mapreduce/bin/start-mapred.sh")
  File "./prepare_benchmark.py", line 180, in ssh_shark
ssh(opts.shark_host, "root", opts.shark_identity_file, command)
  File "./prepare_benchmark.py", line 139, in ssh
(identity_file, username, host, command), shell=True)
  File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no
-i /root/.ssh/id_rsa
r...@ec2-54-201-169-165.us-west-2.compute.amazonaws.com 'source
/root/.bash_profile; 
/root/mapreduce/bin/start-mapred.sh'' returned non-zero exit status 127

 have tried terminating the cluster and launching it again multiples times
but the problem persists. What could be the issue?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AmpLab-Big-Data-Benchmark-for-Spark-error-on-EC2-tp26207.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi Burak,Thanks,  I will then start benchmarking the cluster.

> Date: Wed, 27 Aug 2014 11:52:05 -0700
> From: bya...@stanford.edu
> To: ssti...@live.com
> CC: user@spark.apache.org
> Subject: Re: Amplab: big-data-benchmark
> 
> Hi Sameer,
> 
> I've faced this issue before. They don't show up on 
> http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: 
> `sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")`
> The gotcha is that you also need to supply which dataset you want: crawl, 
> uservisits, or rankings in lower case after the format and size you want them 
> in.
> They should be there.
> 
> Best,
> Burak
> 
> - Original Message -
> From: "Sameer Tilak" 
> To: user@spark.apache.org
> Sent: Wednesday, August 27, 2014 11:42:28 AM
> Subject: Amplab: big-data-benchmark
> 
> Hi All,
> I am planning to run amplab benchmark suite to evaluate the performance of 
> our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it 
> mentions about data avallability at:
> s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
>  /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able 
> to doanload these datasets directly. Here is what I see. I read that they can 
> be used directly by doing : sc.textFile(s3:/). However, I wanted to make 
> sure that my understanding is correct. Here is what I see at 
> http://s3.amazonaws.com/big-data-benchmark/
> I do not see anything for sequence or text-deflate.
> I see sequence-snappy dataset:
> pavlo/sequence-snappy/5nodes/crawl/000738_02013-05-27T21:26:40.000Z"a978d18721d5a533d38a88f558461644"42958735STANDARD
> For text, I get the following error:
> NoSuchKeyThe specified key does not 
> exist.pavlo/text/1node/crawl166D239D383995264Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI
> 
> Please let me know if there is a way to readily download the dataset and view 
> it.   
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
  

Re: Amplab: big-data-benchmark

2014-08-27 Thread Burak Yavuz
Hi Sameer,

I've faced this issue before. They don't show up on 
http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: 
`sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")`
The gotcha is that you also need to supply which dataset you want: crawl, 
uservisits, or rankings in lower case after the format and size you want them 
in.
They should be there.

Best,
Burak

- Original Message -
From: "Sameer Tilak" 
To: user@spark.apache.org
Sent: Wednesday, August 27, 2014 11:42:28 AM
Subject: Amplab: big-data-benchmark

Hi All,
I am planning to run amplab benchmark suite to evaluate the performance of our 
cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions 
about data avallability at:
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
 /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to 
doanload these datasets directly. Here is what I see. I read that they can be 
used directly by doing : sc.textFile(s3:/). However, I wanted to make sure 
that my understanding is correct. Here is what I see at 
http://s3.amazonaws.com/big-data-benchmark/
I do not see anything for sequence or text-deflate.
I see sequence-snappy dataset:
pavlo/sequence-snappy/5nodes/crawl/000738_02013-05-27T21:26:40.000Z"a978d18721d5a533d38a88f558461644"42958735STANDARD
For text, I get the following error:
NoSuchKeyThe specified key does not 
exist.pavlo/text/1node/crawl166D239D383995264Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI

Please let me know if there is a way to readily download the dataset and view 
it. 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi All,
I am planning to run amplab benchmark suite to evaluate the performance of our 
cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions 
about data avallability at:
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
 /tiny/, /1node/ and /5nodes/ are options for suffix. However, I am not able to 
doanload these datasets directly. Here is what I see. I read that they can be 
used directly by doing : sc.textFile(s3:/). However, I wanted to make sure 
that my understanding is correct. Here is what I see at 
http://s3.amazonaws.com/big-data-benchmark/
I do not see anything for sequence or text-deflate.
I see sequence-snappy dataset:
pavlo/sequence-snappy/5nodes/crawl/000738_02013-05-27T21:26:40.000Z"a978d18721d5a533d38a88f558461644"42958735STANDARD
For text, I get the following error:
NoSuchKeyThe specified key does not 
exist.pavlo/text/1node/crawl166D239D383995264Bg8BHomWqJ6BXOkx/3fQZhN5Uw1TtCn01uQzm+1qYffx2s/oPV+9sGoAWV2thCI

Please let me know if there is a way to readily download the dataset and view 
it.