Re: Retrieve dataset of Big Data Benchmark

Burak Yavuz Wed, 16 Jul 2014 12:19:15 -0700

Hi Tom,

Actually I was mistaken, sorry about that. Indeed on the website, the keys for 
the datasets you mention are not showing up. However, 
they are still accessible through the spark-shell, which means that they are 
there.


So in order to answer your questions:
- Are the tiny and 1node sets still available? 

Yes, they are.

- Are the Uservisits and Rankings still available?

Yes, they are.

- Why is the crawl set bigger than expected, and how big is it?

It says on the website that it is ~30 GB per node. Since you're downloading the 
5nodes version, the total size should be 150 GB.

Coming to other ways on you can download them:

I propose using the spark-shell would be easiest (At least for me it was :).

Once you start the spark-shell, you can access the files as (example for the 
tiny crawl dataset, exchange with 1node, 5nodes & uservisits, rankings as 
desired. Mind the lowercase):

val dataset = sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")

dataset.saveAsTextFile("your/local/relative/path/here")

The file will be saved relative to where you run the spark-shell from.

Hope this helps!
Burak


----- Original Message -----
From: "Tom" <thubregt...@gmail.com>
To: u...@spark.incubator.apache.org
Sent: Wednesday, July 16, 2014 9:10:58 AM
Subject: Re: Retrieve dataset of Big Data Benchmark

Hi Burak,

Thank you for your pointer, it is really helping out. I do have some
consecutive questions though.

After looking at the  Big Data Benchmark page
<https://amplab.cs.berkeley.edu/benchmark/>   (Section "Run this benchmark
yourself), I was expecting the following combination of files:
Sets: Uservisits, Rankings, Crawl
Size: tiny, 1node, 5node
Both in text and Sequence file.

When looking at http://s3.amazonaws.com/big-data-benchmark/, I only see  
sequence-snappy/5nodes/_distcp_logs_44js2v part 0 to 103
sequence-snappy/5nodes/_distcp_logs_nclxhd part 0 to 102
sequence-snappy/5nodes/_distcp_logs_vnuhym part 0 to 24
sequence-snappy/5nodes/crawl part 0 to 743

As "Crawl" is the name of a set I am looking for, I started to download it.
Since it was the end of the day and I was going to download it overnight, I
just wrote a for loop from 0 to 999 with wget, expecting it to download
until 7-something and 404 errors for the others. When I looked at it this
morning, I noticed that it all completed downloading. The total Crawl set
for 5 nodes should be ~30Gb, I am currently at part 1020 with a total set of
40G. 

This leads to my (sub)questions:
Does anybody know what exactly is still hosted:
- Are the tiny and 1node sets still available? 
- Are the Uservisits and Rankings still available?
- Why is the crawl set bigger than expected, and how big is it?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p9938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Retrieve dataset of Big Data Benchmark

Reply via email to