Re: ephemeral-hdfs vs persistent-hdfs - performance

Peter Rudenko Wed, 04 Feb 2015 06:22:40 -0800

Hi if i have a 10GB file on s3 and set 10 partitions, would it bedownload whole file on master first and broadcast it or each workerwould just read it's range from the file?


Thanks,
Peter


On 2015-02-03 23:30, Sven Krasser wrote:

Hey Joe,

With the ephemeral HDFS, you get the instance store of your workernodes. For m3.xlarge that will be two 40 GB SSDs local to eachinstance, which are very fast.

For the persistent HDFS, you get whatever EBS volumes the launchscript configured. EBS volumes are always network drives, so the usuallimitations apply. To optimize throughput, you can use EBS volumeswith provisioned IOPS and you can use EBS optimized instances. I don'thave hard numbers at hand, but I'd expect this to be noticeably slowerthan using local SSDs.

As far as only using S3 goes, it depends on your use case (i.e. whatyou plan on doing with the data while it is there). If you store itthere in between running different applications, you can likely workaround consistency issues.

Also, if you use Amazon's EMRFS to access data in S3, you can usetheir new consistency feature(https://aws.amazon.com/blogs/aws/emr-consistent-file-system/).


Hope this helps!
-Sven

On Tue, Feb 3, 2015 at 9:32 AM, Joe Wass <jw...@crossref.org<mailto:jw...@crossref.org>> wrote:


    The data is coming from S3 in the first place, and the results
    will be uploaded back there. But even in the same availability
    zone, fetching 170 GB (that's gzipped) is slow. From what I
    understand of the pipelines, multiple transforms on the same RDD
    might involve re-reading the input, which very quickly add up in
    comparison to having the data locally. Unless I persisted the data
    (which I am in fact doing) but that would involve storing
    approximately the same amount of data in HDFS, which wouldn't fit.

    Also, I understood that S3 was unsuitable for practical? See "Why
    you cannot use S3 as a replacement for HDFS"[0]. I'd love to be
    proved wrong, though, that would make things a lot easier.

    [0] http://wiki.apache.org/hadoop/AmazonS3



    On 3 February 2015 at 16:45, David Rosenstrauch <dar...@darose.net
    <mailto:dar...@darose.net>> wrote:

        You could also just push the data to Amazon S3, which would
        un-link the size of the cluster needed to process the data
        from the size of the data.

        DR


        On 02/03/2015 11:43 AM, Joe Wass wrote:

            I want to process about 800 GB of data on an Amazon EC2
            cluster. So, I need
            to store the input in HDFS somehow.

            I currently have a cluster of 5 x m3.xlarge, each of which
            has 80GB disk.
            Each HDFS node reports 73 GB, and the total capacity is
            ~370 GB.

            If I want to process 800 GB of data (assuming I can't
            split the jobs up),
            I'm guessing I need to get persistent-hdfs involved.

            1 - Does persistent-hdfs have noticeably different
            performance than
            ephemeral-hdfs?
            2 - If so, is there a recommended configuration (like
            storing input and
            output on persistent, but persisted RDDs on ephemeral?)

            This seems like a common use-case, so sorry if this has
            already been
            covered.

            Joe



        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>





--
http://sites.google.com/site/krasser/?utm_source=sig

Re: ephemeral-hdfs vs persistent-hdfs - performance

Reply via email to