Thanks very much, that's good to know, I'll certainly give it a look. Can you give me a hint about you unzip your input files on the fly? I thought that it wasn't possible to parallelize zipped inputs unless they were unzipped before passing to Spark?
Joe On 3 February 2015 at 17:48, David Rosenstrauch <dar...@darose.net> wrote: > We use S3 as a main storage for all our input data and our generated > (output) data. (10's of terabytes of data daily.) We read gzipped data > directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long > as you parallelize the work well by distributing the processing across > enough machines. (About 100 nodes, in our case.) > > The way we generally operate is re: storage is: read input directly from > s3, write output from Hadoop/Spark jobs to HDFS, then after job is complete > distcp the relevant output from HDFS back to S3. Works for us ... YMMV. > :-) > > HTH, > > DR > > > On 02/03/2015 12:32 PM, Joe Wass wrote: > >> The data is coming from S3 in the first place, and the results will be >> uploaded back there. But even in the same availability zone, fetching 170 >> GB (that's gzipped) is slow. From what I understand of the pipelines, >> multiple transforms on the same RDD might involve re-reading the input, >> which very quickly add up in comparison to having the data locally. Unless >> I persisted the data (which I am in fact doing) but that would involve >> storing approximately the same amount of data in HDFS, which wouldn't fit. >> >> Also, I understood that S3 was unsuitable for practical? See "Why you >> cannot use S3 as a replacement for HDFS"[0]. I'd love to be proved wrong, >> though, that would make things a lot easier. >> >> [0] http://wiki.apache.org/hadoop/AmazonS3 >> >> >> >> On 3 February 2015 at 16:45, David Rosenstrauch <dar...@darose.net> >> wrote: >> >> You could also just push the data to Amazon S3, which would un-link the >>> size of the cluster needed to process the data from the size of the data. >>> >>> DR >>> >>> >>> On 02/03/2015 11:43 AM, Joe Wass wrote: >>> >>> I want to process about 800 GB of data on an Amazon EC2 cluster. So, I >>>> need >>>> to store the input in HDFS somehow. >>>> >>>> I currently have a cluster of 5 x m3.xlarge, each of which has 80GB >>>> disk. >>>> Each HDFS node reports 73 GB, and the total capacity is ~370 GB. >>>> >>>> If I want to process 800 GB of data (assuming I can't split the jobs >>>> up), >>>> I'm guessing I need to get persistent-hdfs involved. >>>> >>>> 1 - Does persistent-hdfs have noticeably different performance than >>>> ephemeral-hdfs? >>>> 2 - If so, is there a recommended configuration (like storing input and >>>> output on persistent, but persisted RDDs on ephemeral?) >>>> >>>> This seems like a common use-case, so sorry if this has already been >>>> covered. >>>> >>>> Joe >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >