> On 11 Apr 2017, at 11:07, Zeming Yu <zemin...@gmail.com> wrote:
> 
> Hi all,
> 
> I'm a beginner with spark, and I'm wondering if someone could provide 
> guidance on the following 2 questions I have.
> 
> Background: I have a data set growing by 6 TB p.a. I plan to use spark to 
> read in all the data, manipulate it and build a predictive model on it (say 
> GBM) I plan to store the data in S3, and use EMR to launch spark, reading in 
> data from S3.
> 
> 1. Which option is best for storing the data on S3 for the purpose of 
> analysing it in EMR spark?
> Option A: storing the 6TB file as 173 million individual text files
> Option B: zipping up the above 173 million text files as 240,000 zip files
> Option C: appending the individual text files, so have 240,000 text files p.a.
> Option D: combining the text files even further
> 

everything works best if your sources are a few tens to hundreds of MB or more 
of your data, work can be partitioned up by file. If you use more structured 
formats (avro compressed with snappy, etc), you can throw > 1 executor at work 
inside a file. Structure is handy all round, even if its just adding timestamp 
and provenance columns to each data file.

there's the HAR file format from Hadoop which can merge lots of small files 
into larger ones, allowing work to be scheduled per har file. Recommended for 
HDFS as it hates small files, on S3 you still have limits on small files 
(including throttling of HTTP requests to shards of a bucket), but they are 
less significant.

One thing to be aware is that the s3 clients spark use are very inefficient in 
listing wide directory trees, and Spark not always the best at partitioning 
work because of this. You can accidentally create a very inefficient tree 
structure like datasets/year=2017/month=5/day=10/hour=12/, with only one file 
per hour. Listing and partitioning suffers here, and while s3a on Hadoop 2.8 is 
better here, Spark hasn't yet fully adapted to those changes (use of specific 
API calls). There's also a lot more to be done in S3A to handle wildcards in 
the directory tree much more efficiently (HADOOP-13204); needs to address 
pattens like (datasets/year=201?/month=*/day=10) without treewalking and 
without fetching too much data from wildcards near the top of the tree. We need 
to avoid implementing something which works well on *my* layouts, but 
absolutely dies on other people's. As is usual in OSS, help welcome; early 
testing here as critical as coding, so as to ensure things will work with your 
file structures

-Steve


> 2. Any recommendations on the EMR set up to analyse the 6TB of data all at 
> once and build a GBM, in terms of
> 1) The type of EC2 instances I need?
> 2) The number of such instances I need?
> 3) Rough estimate of cost?
> 

no opinion there

> 
> Thanks so much,
> Zeming
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to