Thanks very much, that's good to know, I'll certainly give it a look.

Can you give me a hint about you unzip your input files on the fly? I
thought that it wasn't possible to parallelize zipped inputs unless they
were unzipped before passing to Spark?

Joe

On 3 February 2015 at 17:48, David Rosenstrauch <dar...@darose.net> wrote:

> We use S3 as a main storage for all our input data and our generated
> (output) data.  (10's of terabytes of data daily.)  We read gzipped data
> directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long
> as you parallelize the work well by distributing the processing across
> enough machines.  (About 100 nodes, in our case.)
>
> The way we generally operate is re: storage is:  read input directly from
> s3, write output from Hadoop/Spark jobs to HDFS, then after job is complete
> distcp the relevant output from HDFS back to S3.  Works for us ... YMMV.
> :-)
>
> HTH,
>
> DR
>
>
> On 02/03/2015 12:32 PM, Joe Wass wrote:
>
>> The data is coming from S3 in the first place, and the results will be
>> uploaded back there. But even in the same availability zone, fetching 170
>> GB (that's gzipped) is slow. From what I understand of the pipelines,
>> multiple transforms on the same RDD might involve re-reading the input,
>> which very quickly add up in comparison to having the data locally. Unless
>> I persisted the data (which I am in fact doing) but that would involve
>> storing approximately the same amount of data in HDFS, which wouldn't fit.
>>
>> Also, I understood that S3 was unsuitable for practical? See "Why you
>> cannot use S3 as a replacement for HDFS"[0]. I'd love to be proved wrong,
>> though, that would make things a lot easier.
>>
>> [0] http://wiki.apache.org/hadoop/AmazonS3
>>
>>
>>
>> On 3 February 2015 at 16:45, David Rosenstrauch <dar...@darose.net>
>> wrote:
>>
>>  You could also just push the data to Amazon S3, which would un-link the
>>> size of the cluster needed to process the data from the size of the data.
>>>
>>> DR
>>>
>>>
>>> On 02/03/2015 11:43 AM, Joe Wass wrote:
>>>
>>>  I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
>>>> need
>>>> to store the input in HDFS somehow.
>>>>
>>>> I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
>>>> disk.
>>>> Each HDFS node reports 73 GB, and the total capacity is ~370 GB.
>>>>
>>>> If I want to process 800 GB of data (assuming I can't split the jobs
>>>> up),
>>>> I'm guessing I need to get persistent-hdfs involved.
>>>>
>>>> 1 - Does persistent-hdfs have noticeably different performance than
>>>> ephemeral-hdfs?
>>>> 2 - If so, is there a recommended configuration (like storing input and
>>>> output on persistent, but persisted RDDs on ephemeral?)
>>>>
>>>> This seems like a common use-case, so sorry if this has already been
>>>> covered.
>>>>
>>>> Joe
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to