Hi Burak,

Thank you for your pointer, it is really helping out. I do have some
consecutive questions though.

After looking at the  Big Data Benchmark page
<https://amplab.cs.berkeley.edu/benchmark/>   (Section "Run this benchmark
yourself), I was expecting the following combination of files:
Sets: Uservisits, Rankings, Crawl
Size: tiny, 1node, 5node
Both in text and Sequence file.

When looking at http://s3.amazonaws.com/big-data-benchmark/, I only see  
sequence-snappy/5nodes/_distcp_logs_44js2v part 0 to 103
sequence-snappy/5nodes/_distcp_logs_nclxhd part 0 to 102
sequence-snappy/5nodes/_distcp_logs_vnuhym part 0 to 24
sequence-snappy/5nodes/crawl part 0 to 743

As "Crawl" is the name of a set I am looking for, I started to download it.
Since it was the end of the day and I was going to download it overnight, I
just wrote a for loop from 0 to 999 with wget, expecting it to download
until 7-something and 404 errors for the others. When I looked at it this
morning, I noticed that it all completed downloading. The total Crawl set
for 5 nodes should be ~30Gb, I am currently at part 1020 with a total set of
40G. 

This leads to my (sub)questions:
Does anybody know what exactly is still hosted:
- Are the tiny and 1node sets still available? 
- Are the Uservisits and Rankings still available?
- Why is the crawl set bigger than expected, and how big is it?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p9938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to