Hi Tom, Actually I was mistaken, sorry about that. Indeed on the website, the keys for the datasets you mention are not showing up. However, they are still accessible through the spark-shell, which means that they are there.
So in order to answer your questions: - Are the tiny and 1node sets still available? Yes, they are. - Are the Uservisits and Rankings still available? Yes, they are. - Why is the crawl set bigger than expected, and how big is it? It says on the website that it is ~30 GB per node. Since you're downloading the 5nodes version, the total size should be 150 GB. Coming to other ways on you can download them: I propose using the spark-shell would be easiest (At least for me it was :). Once you start the spark-shell, you can access the files as (example for the tiny crawl dataset, exchange with 1node, 5nodes & uservisits, rankings as desired. Mind the lowercase): val dataset = sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl") dataset.saveAsTextFile("your/local/relative/path/here") The file will be saved relative to where you run the spark-shell from. Hope this helps! Burak ----- Original Message ----- From: "Tom" <thubregt...@gmail.com> To: u...@spark.incubator.apache.org Sent: Wednesday, July 16, 2014 9:10:58 AM Subject: Re: Retrieve dataset of Big Data Benchmark Hi Burak, Thank you for your pointer, it is really helping out. I do have some consecutive questions though. After looking at the Big Data Benchmark page <https://amplab.cs.berkeley.edu/benchmark/> (Section "Run this benchmark yourself), I was expecting the following combination of files: Sets: Uservisits, Rankings, Crawl Size: tiny, 1node, 5node Both in text and Sequence file. When looking at http://s3.amazonaws.com/big-data-benchmark/, I only see sequence-snappy/5nodes/_distcp_logs_44js2v part 0 to 103 sequence-snappy/5nodes/_distcp_logs_nclxhd part 0 to 102 sequence-snappy/5nodes/_distcp_logs_vnuhym part 0 to 24 sequence-snappy/5nodes/crawl part 0 to 743 As "Crawl" is the name of a set I am looking for, I started to download it. Since it was the end of the day and I was going to download it overnight, I just wrote a for loop from 0 to 999 with wget, expecting it to download until 7-something and 404 errors for the others. When I looked at it this morning, I noticed that it all completed downloading. The total Crawl set for 5 nodes should be ~30Gb, I am currently at part 1020 with a total set of 40G. This leads to my (sub)questions: Does anybody know what exactly is still hosted: - Are the tiny and 1node sets still available? - Are the Uservisits and Rankings still available? - Why is the crawl set bigger than expected, and how big is it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821p9938.html Sent from the Apache Spark User List mailing list archive at Nabble.com.