Re: optimising storage and ec2 instances

2017-04-11 Thread Sam Elamin
Hi Zeming Yu, Steve Just to add, we are also going down partitioning using this route but you should know if you are in AWS land, you are most likely going to use EMRs at any given time At the moment EMRs does not do recursive search on wildcards, see this

Re: optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
everything works best if your sources are a few tens to hundreds of MB or more Are you referring to the size of the zip file or individual unzipped files? Any issues with storing a 60 mb zipped file containing heaps of text files inside? On 11 Apr. 2017 9:09 pm, "Steve Loughran"

Re: optimising storage and ec2 instances

2017-04-11 Thread Steve Loughran
> On 11 Apr 2017, at 11:07, Zeming Yu wrote: > > Hi all, > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > Background: I have a data set growing by 6 TB p.a. I plan to use spark to > read in all

optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
Hi all, I'm a beginner with spark, and I'm wondering if someone could provide guidance on the following 2 questions I have. Background: I have a data set growing by 6 TB p.a. I plan to use spark to read in all the data, manipulate it and build a predictive model on it (say GBM) I plan to store