Not all of our input files are zipped. The ones that are obviously are not parallelized - they're just processed by a single task. Not a big issue for us, though, as the those zipped files aren't too big.

DR

On 02/03/2015 01:08 PM, Joe Wass wrote:
Thanks very much, that's good to know, I'll certainly give it a look.

Can you give me a hint about you unzip your input files on the fly? I
thought that it wasn't possible to parallelize zipped inputs unless they
were unzipped before passing to Spark?

Joe

On 3 February 2015 at 17:48, David Rosenstrauch <dar...@darose.net> wrote:

We use S3 as a main storage for all our input data and our generated
(output) data.  (10's of terabytes of data daily.)  We read gzipped data
directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long
as you parallelize the work well by distributing the processing across
enough machines.  (About 100 nodes, in our case.)

The way we generally operate is re: storage is:  read input directly from
s3, write output from Hadoop/Spark jobs to HDFS, then after job is complete
distcp the relevant output from HDFS back to S3.  Works for us ... YMMV.
:-)

HTH,

DR


On 02/03/2015 12:32 PM, Joe Wass wrote:

The data is coming from S3 in the first place, and the results will be
uploaded back there. But even in the same availability zone, fetching 170
GB (that's gzipped) is slow. From what I understand of the pipelines,
multiple transforms on the same RDD might involve re-reading the input,
which very quickly add up in comparison to having the data locally. Unless
I persisted the data (which I am in fact doing) but that would involve
storing approximately the same amount of data in HDFS, which wouldn't fit.

Also, I understood that S3 was unsuitable for practical? See "Why you
cannot use S3 as a replacement for HDFS"[0]. I'd love to be proved wrong,
though, that would make things a lot easier.

[0] http://wiki.apache.org/hadoop/AmazonS3



On 3 February 2015 at 16:45, David Rosenstrauch <dar...@darose.net>
wrote:

  You could also just push the data to Amazon S3, which would un-link the
size of the cluster needed to process the data from the size of the data.

DR


On 02/03/2015 11:43 AM, Joe Wass wrote:

  I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
need
to store the input in HDFS somehow.

I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

If I want to process 800 GB of data (assuming I can't split the jobs
up),
I'm guessing I need to get persistent-hdfs involved.

1 - Does persistent-hdfs have noticeably different performance than
ephemeral-hdfs?
2 - If so, is there a recommended configuration (like storing input and
output on persistent, but persisted RDDs on ephemeral?)

This seems like a common use-case, so sorry if this has already been
covered.

Joe


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to