> On 11 Apr 2017, at 11:07, Zeming Yu <zemin...@gmail.com> wrote: > > Hi all, > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > Background: I have a data set growing by 6 TB p.a. I plan to use spark to > read in all the data, manipulate it and build a predictive model on it (say > GBM) I plan to store the data in S3, and use EMR to launch spark, reading in > data from S3. > > 1. Which option is best for storing the data on S3 for the purpose of > analysing it in EMR spark? > Option A: storing the 6TB file as 173 million individual text files > Option B: zipping up the above 173 million text files as 240,000 zip files > Option C: appending the individual text files, so have 240,000 text files p.a. > Option D: combining the text files even further >
everything works best if your sources are a few tens to hundreds of MB or more of your data, work can be partitioned up by file. If you use more structured formats (avro compressed with snappy, etc), you can throw > 1 executor at work inside a file. Structure is handy all round, even if its just adding timestamp and provenance columns to each data file. there's the HAR file format from Hadoop which can merge lots of small files into larger ones, allowing work to be scheduled per har file. Recommended for HDFS as it hates small files, on S3 you still have limits on small files (including throttling of HTTP requests to shards of a bucket), but they are less significant. One thing to be aware is that the s3 clients spark use are very inefficient in listing wide directory trees, and Spark not always the best at partitioning work because of this. You can accidentally create a very inefficient tree structure like datasets/year=2017/month=5/day=10/hour=12/, with only one file per hour. Listing and partitioning suffers here, and while s3a on Hadoop 2.8 is better here, Spark hasn't yet fully adapted to those changes (use of specific API calls). There's also a lot more to be done in S3A to handle wildcards in the directory tree much more efficiently (HADOOP-13204); needs to address pattens like (datasets/year=201?/month=*/day=10) without treewalking and without fetching too much data from wildcards near the top of the tree. We need to avoid implementing something which works well on *my* layouts, but absolutely dies on other people's. As is usual in OSS, help welcome; early testing here as critical as coding, so as to ensure things will work with your file structures -Steve > 2. Any recommendations on the EMR set up to analyse the 6TB of data all at > once and build a GBM, in terms of > 1) The type of EC2 instances I need? > 2) The number of such instances I need? > 3) Rough estimate of cost? > no opinion there > > Thanks so much, > Zeming > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org