subject:"ephemeral\-hdfs vs persistent\-hdfs \- performance"

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Kelvin Chu

Joe, I also use S3 and gzip. So far the I/O is not a problem. In my case,
the operation is SQLContext.JsonFile() and I can see from Ganglia that the
whole cluster is CPU bound (99% saturated). I have 160 cores and I can see
the network can sustain about 150MBit/s.

Kelvin

On Wed, Feb 4, 2015 at 10:20 AM, Aaron Davidson ilike...@gmail.com wrote:

 The latter would be faster. With S3, you want to maximize number of
 concurrent readers until you hit your network throughput limits.

 On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko petro.rude...@gmail.com
 wrote:

  Hi if i have a 10GB file on s3 and set 10 partitions, would it be
 download whole file on master first and broadcast it or each worker would
 just read it's range from the file?

 Thanks,
 Peter

 On 2015-02-03 23:30, Sven Krasser wrote:

  Hey Joe,

 With the ephemeral HDFS, you get the instance store of your worker nodes.
 For m3.xlarge that will be two 40 GB SSDs local to each instance, which are
 very fast.

  For the persistent HDFS, you get whatever EBS volumes the launch script
 configured. EBS volumes are always network drives, so the usual limitations
 apply. To optimize throughput, you can use EBS volumes with provisioned
 IOPS and you can use EBS optimized instances. I don't have hard numbers at
 hand, but I'd expect this to be noticeably slower than using local SSDs.

 As far as only using S3 goes, it depends on your use case (i.e. what you
 plan on doing with the data while it is there). If you store it there in
 between running different applications, you can likely work around
 consistency issues.

 Also, if you use Amazon's EMRFS to access data in S3, you can use their
 new consistency feature (
 https://aws.amazon.com/blogs/aws/emr-consistent-file-system/).

 Hope this helps!
 -Sven


 On Tue, Feb 3, 2015 at 9:32 AM, Joe Wass jw...@crossref.org wrote:

 The data is coming from S3 in the first place, and the results will be
 uploaded back there. But even in the same availability zone, fetching 170
 GB (that's gzipped) is slow. From what I understand of the pipelines,
 multiple transforms on the same RDD might involve re-reading the input,
 which very quickly add up in comparison to having the data locally. Unless
 I persisted the data (which I am in fact doing) but that would involve
 storing approximately the same amount of data in HDFS, which wouldn't fit.

  Also, I understood that S3 was unsuitable for practical? See Why you
 cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
 though, that would make things a lot easier.

  [0] http://wiki.apache.org/hadoop/AmazonS3



 On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net
 wrote:

 You could also just push the data to Amazon S3, which would un-link the
 size of the cluster needed to process the data from the size of the data.

 DR


 On 02/03/2015 11:43 AM, Joe Wass wrote:

 I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
 need
 to store the input in HDFS somehow.

 I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
 disk.
 Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

 If I want to process 800 GB of data (assuming I can't split the jobs
 up),
 I'm guessing I need to get persistent-hdfs involved.

 1 - Does persistent-hdfs have noticeably different performance than
 ephemeral-hdfs?
 2 - If so, is there a recommended configuration (like storing input and
 output on persistent, but persisted RDDs on ephemeral?)

 This seems like a common use-case, so sorry if this has already been
 covered.

 Joe



  -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 http://sites.google.com/site/krasser/?utm_source=sig

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Aaron Davidson

The latter would be faster. With S3, you want to maximize number of
concurrent readers until you hit your network throughput limits.

On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Hi if i have a 10GB file on s3 and set 10 partitions, would it be
 download whole file on master first and broadcast it or each worker would
 just read it's range from the file?

 Thanks,
 Peter

 On 2015-02-03 23:30, Sven Krasser wrote:

  Hey Joe,

 With the ephemeral HDFS, you get the instance store of your worker nodes.
 For m3.xlarge that will be two 40 GB SSDs local to each instance, which are
 very fast.

  For the persistent HDFS, you get whatever EBS volumes the launch script
 configured. EBS volumes are always network drives, so the usual limitations
 apply. To optimize throughput, you can use EBS volumes with provisioned
 IOPS and you can use EBS optimized instances. I don't have hard numbers at
 hand, but I'd expect this to be noticeably slower than using local SSDs.

 As far as only using S3 goes, it depends on your use case (i.e. what you
 plan on doing with the data while it is there). If you store it there in
 between running different applications, you can likely work around
 consistency issues.

 Also, if you use Amazon's EMRFS to access data in S3, you can use their
 new consistency feature (
 https://aws.amazon.com/blogs/aws/emr-consistent-file-system/).

 Hope this helps!
 -Sven


 On Tue, Feb 3, 2015 at 9:32 AM, Joe Wass jw...@crossref.org wrote:

 The data is coming from S3 in the first place, and the results will be
 uploaded back there. But even in the same availability zone, fetching 170
 GB (that's gzipped) is slow. From what I understand of the pipelines,
 multiple transforms on the same RDD might involve re-reading the input,
 which very quickly add up in comparison to having the data locally. Unless
 I persisted the data (which I am in fact doing) but that would involve
 storing approximately the same amount of data in HDFS, which wouldn't fit.

  Also, I understood that S3 was unsuitable for practical? See Why you
 cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
 though, that would make things a lot easier.

  [0] http://wiki.apache.org/hadoop/AmazonS3



 On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net
 wrote:

 You could also just push the data to Amazon S3, which would un-link the
 size of the cluster needed to process the data from the size of the data.

 DR


 On 02/03/2015 11:43 AM, Joe Wass wrote:

 I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
 need
 to store the input in HDFS somehow.

 I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
 disk.
 Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

 If I want to process 800 GB of data (assuming I can't split the jobs
 up),
 I'm guessing I need to get persistent-hdfs involved.

 1 - Does persistent-hdfs have noticeably different performance than
 ephemeral-hdfs?
 2 - If so, is there a recommended configuration (like storing input and
 output on persistent, but persisted RDDs on ephemeral?)

 This seems like a common use-case, so sorry if this has already been
 covered.

 Joe



  -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 http://sites.google.com/site/krasser/?utm_source=sig

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch

You could also just push the data to Amazon S3, which would un-link the 
size of the cluster needed to process the data from the size of the data.


DR

On 02/03/2015 11:43 AM, Joe Wass wrote:

I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS somehow.

I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

If I want to process 800 GB of data (assuming I can't split the jobs up),
I'm guessing I need to get persistent-hdfs involved.

1 - Does persistent-hdfs have noticeably different performance than
ephemeral-hdfs?
2 - If so, is there a recommended configuration (like storing input and
output on persistent, but persisted RDDs on ephemeral?)

This seems like a common use-case, so sorry if this has already been
covered.

Joe




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass

I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS somehow.

I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

If I want to process 800 GB of data (assuming I can't split the jobs up),
I'm guessing I need to get persistent-hdfs involved.

1 - Does persistent-hdfs have noticeably different performance than
ephemeral-hdfs?
2 - If so, is there a recommended configuration (like storing input and
output on persistent, but persisted RDDs on ephemeral?)

This seems like a common use-case, so sorry if this has already been
covered.

Joe

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass

The data is coming from S3 in the first place, and the results will be
uploaded back there. But even in the same availability zone, fetching 170
GB (that's gzipped) is slow. From what I understand of the pipelines,
multiple transforms on the same RDD might involve re-reading the input,
which very quickly add up in comparison to having the data locally. Unless
I persisted the data (which I am in fact doing) but that would involve
storing approximately the same amount of data in HDFS, which wouldn't fit.

Also, I understood that S3 was unsuitable for practical? See Why you
cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
though, that would make things a lot easier.

[0] http://wiki.apache.org/hadoop/AmazonS3



On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net wrote:

 You could also just push the data to Amazon S3, which would un-link the
 size of the cluster needed to process the data from the size of the data.

 DR


 On 02/03/2015 11:43 AM, Joe Wass wrote:

 I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
 need
 to store the input in HDFS somehow.

 I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
 Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

 If I want to process 800 GB of data (assuming I can't split the jobs up),
 I'm guessing I need to get persistent-hdfs involved.

 1 - Does persistent-hdfs have noticeably different performance than
 ephemeral-hdfs?
 2 - If so, is there a recommended configuration (like storing input and
 output on persistent, but persisted RDDs on ephemeral?)

 This seems like a common use-case, so sorry if this has already been
 covered.

 Joe



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch

We use S3 as a main storage for all our input data and our generated 
(output) data.  (10's of terabytes of data daily.)  We read gzipped data 
directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as 
long as you parallelize the work well by distributing the processing 
across enough machines.  (About 100 nodes, in our case.)


The way we generally operate is re: storage is:  read input directly 
from s3, write output from Hadoop/Spark jobs to HDFS, then after job is 
complete distcp the relevant output from HDFS back to S3.  Works for us 
... YMMV.  :-)


HTH,

DR

On 02/03/2015 12:32 PM, Joe Wass wrote:

The data is coming from S3 in the first place, and the results will be
uploaded back there. But even in the same availability zone, fetching 170
GB (that's gzipped) is slow. From what I understand of the pipelines,
multiple transforms on the same RDD might involve re-reading the input,
which very quickly add up in comparison to having the data locally. Unless
I persisted the data (which I am in fact doing) but that would involve
storing approximately the same amount of data in HDFS, which wouldn't fit.

Also, I understood that S3 was unsuitable for practical? See Why you
cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
though, that would make things a lot easier.

[0] http://wiki.apache.org/hadoop/AmazonS3



On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net wrote:


You could also just push the data to Amazon S3, which would un-link the
size of the cluster needed to process the data from the size of the data.

DR


On 02/03/2015 11:43 AM, Joe Wass wrote:


I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
need
to store the input in HDFS somehow.

I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

If I want to process 800 GB of data (assuming I can't split the jobs up),
I'm guessing I need to get persistent-hdfs involved.

1 - Does persistent-hdfs have noticeably different performance than
ephemeral-hdfs?
2 - If so, is there a recommended configuration (like storing input and
output on persistent, but persisted RDDs on ephemeral?)

This seems like a common use-case, so sorry if this has already been
covered.

Joe




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Ted Yu

Using s3a protocol (introduced in hadoop 2.6.0) would be faster compared to
s3.

The upcoming hadoop 2.7.0 contains some bug fixes for s3a.

FYI

On Tue, Feb 3, 2015 at 9:48 AM, David Rosenstrauch dar...@darose.net
wrote:

 We use S3 as a main storage for all our input data and our generated
 (output) data.  (10's of terabytes of data daily.)  We read gzipped data
 directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long
 as you parallelize the work well by distributing the processing across
 enough machines.  (About 100 nodes, in our case.)

 The way we generally operate is re: storage is:  read input directly from
 s3, write output from Hadoop/Spark jobs to HDFS, then after job is complete
 distcp the relevant output from HDFS back to S3.  Works for us ... YMMV.
 :-)

 HTH,

 DR


 On 02/03/2015 12:32 PM, Joe Wass wrote:

 The data is coming from S3 in the first place, and the results will be
 uploaded back there. But even in the same availability zone, fetching 170
 GB (that's gzipped) is slow. From what I understand of the pipelines,
 multiple transforms on the same RDD might involve re-reading the input,
 which very quickly add up in comparison to having the data locally. Unless
 I persisted the data (which I am in fact doing) but that would involve
 storing approximately the same amount of data in HDFS, which wouldn't fit.

 Also, I understood that S3 was unsuitable for practical? See Why you
 cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
 though, that would make things a lot easier.

 [0] http://wiki.apache.org/hadoop/AmazonS3



 On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net
 wrote:

  You could also just push the data to Amazon S3, which would un-link the
 size of the cluster needed to process the data from the size of the data.

 DR


 On 02/03/2015 11:43 AM, Joe Wass wrote:

  I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
 need
 to store the input in HDFS somehow.

 I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
 disk.
 Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

 If I want to process 800 GB of data (assuming I can't split the jobs
 up),
 I'm guessing I need to get persistent-hdfs involved.

 1 - Does persistent-hdfs have noticeably different performance than
 ephemeral-hdfs?
 2 - If so, is there a recommended configuration (like storing input and
 output on persistent, but persisted RDDs on ephemeral?)

 This seems like a common use-case, so sorry if this has already been
 covered.

 Joe



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass

Thanks very much, that's good to know, I'll certainly give it a look.

Can you give me a hint about you unzip your input files on the fly? I
thought that it wasn't possible to parallelize zipped inputs unless they
were unzipped before passing to Spark?

Joe

On 3 February 2015 at 17:48, David Rosenstrauch dar...@darose.net wrote:

 We use S3 as a main storage for all our input data and our generated
 (output) data.  (10's of terabytes of data daily.)  We read gzipped data
 directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long
 as you parallelize the work well by distributing the processing across
 enough machines.  (About 100 nodes, in our case.)

 The way we generally operate is re: storage is:  read input directly from
 s3, write output from Hadoop/Spark jobs to HDFS, then after job is complete
 distcp the relevant output from HDFS back to S3.  Works for us ... YMMV.
 :-)

 HTH,

 DR


 On 02/03/2015 12:32 PM, Joe Wass wrote:

 The data is coming from S3 in the first place, and the results will be
 uploaded back there. But even in the same availability zone, fetching 170
 GB (that's gzipped) is slow. From what I understand of the pipelines,
 multiple transforms on the same RDD might involve re-reading the input,
 which very quickly add up in comparison to having the data locally. Unless
 I persisted the data (which I am in fact doing) but that would involve
 storing approximately the same amount of data in HDFS, which wouldn't fit.

 Also, I understood that S3 was unsuitable for practical? See Why you
 cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
 though, that would make things a lot easier.

 [0] http://wiki.apache.org/hadoop/AmazonS3



 On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net
 wrote:

  You could also just push the data to Amazon S3, which would un-link the
 size of the cluster needed to process the data from the size of the data.

 DR


 On 02/03/2015 11:43 AM, Joe Wass wrote:

  I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
 need
 to store the input in HDFS somehow.

 I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
 disk.
 Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

 If I want to process 800 GB of data (assuming I can't split the jobs
 up),
 I'm guessing I need to get persistent-hdfs involved.

 1 - Does persistent-hdfs have noticeably different performance than
 ephemeral-hdfs?
 2 - If so, is there a recommended configuration (like storing input and
 output on persistent, but persisted RDDs on ephemeral?)

 This seems like a common use-case, so sorry if this has already been
 covered.

 Joe



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch

Not all of our input files are zipped.  The ones that are obviously are 
not parallelized - they're just processed by a single task.  Not a big 
issue for us, though, as the those zipped files aren't too big.


DR

On 02/03/2015 01:08 PM, Joe Wass wrote:

Thanks very much, that's good to know, I'll certainly give it a look.

Can you give me a hint about you unzip your input files on the fly? I
thought that it wasn't possible to parallelize zipped inputs unless they
were unzipped before passing to Spark?

Joe

On 3 February 2015 at 17:48, David Rosenstrauch dar...@darose.net wrote:


We use S3 as a main storage for all our input data and our generated
(output) data.  (10's of terabytes of data daily.)  We read gzipped data
directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long
as you parallelize the work well by distributing the processing across
enough machines.  (About 100 nodes, in our case.)

The way we generally operate is re: storage is:  read input directly from
s3, write output from Hadoop/Spark jobs to HDFS, then after job is complete
distcp the relevant output from HDFS back to S3.  Works for us ... YMMV.
:-)

HTH,

DR


On 02/03/2015 12:32 PM, Joe Wass wrote:


The data is coming from S3 in the first place, and the results will be
uploaded back there. But even in the same availability zone, fetching 170
GB (that's gzipped) is slow. From what I understand of the pipelines,
multiple transforms on the same RDD might involve re-reading the input,
which very quickly add up in comparison to having the data locally. Unless
I persisted the data (which I am in fact doing) but that would involve
storing approximately the same amount of data in HDFS, which wouldn't fit.

Also, I understood that S3 was unsuitable for practical? See Why you
cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
though, that would make things a lot easier.

[0] http://wiki.apache.org/hadoop/AmazonS3



On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net
wrote:

  You could also just push the data to Amazon S3, which would un-link the

size of the cluster needed to process the data from the size of the data.

DR


On 02/03/2015 11:43 AM, Joe Wass wrote:

  I want to process about 800 GB of data on an Amazon EC2 cluster. So, I

need
to store the input in HDFS somehow.

I currently have a cluster of 5 x m3.xlarge, each of which has 80GB
disk.
Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

If I want to process 800 GB of data (assuming I can't split the jobs
up),
I'm guessing I need to get persistent-hdfs involved.

1 - Does persistent-hdfs have noticeably different performance than
ephemeral-hdfs?
2 - If so, is there a recommended configuration (like storing input and
output on persistent, but persisted RDDs on ephemeral?)

This seems like a common use-case, so sorry if this has already been
covered.

Joe



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Sven Krasser

Hey Joe,

With the ephemeral HDFS, you get the instance store of your worker nodes.
For m3.xlarge that will be two 40 GB SSDs local to each instance, which are
very fast.

For the persistent HDFS, you get whatever EBS volumes the launch script
configured. EBS volumes are always network drives, so the usual limitations
apply. To optimize throughput, you can use EBS volumes with provisioned
IOPS and you can use EBS optimized instances. I don't have hard numbers at
hand, but I'd expect this to be noticeably slower than using local SSDs.

As far as only using S3 goes, it depends on your use case (i.e. what you
plan on doing with the data while it is there). If you store it there in
between running different applications, you can likely work around
consistency issues.

Also, if you use Amazon's EMRFS to access data in S3, you can use their new
consistency feature (
https://aws.amazon.com/blogs/aws/emr-consistent-file-system/).

Hope this helps!
-Sven


On Tue, Feb 3, 2015 at 9:32 AM, Joe Wass jw...@crossref.org wrote:

 The data is coming from S3 in the first place, and the results will be
 uploaded back there. But even in the same availability zone, fetching 170
 GB (that's gzipped) is slow. From what I understand of the pipelines,
 multiple transforms on the same RDD might involve re-reading the input,
 which very quickly add up in comparison to having the data locally. Unless
 I persisted the data (which I am in fact doing) but that would involve
 storing approximately the same amount of data in HDFS, which wouldn't fit.

 Also, I understood that S3 was unsuitable for practical? See Why you
 cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong,
 though, that would make things a lot easier.

 [0] http://wiki.apache.org/hadoop/AmazonS3



 On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net wrote:

 You could also just push the data to Amazon S3, which would un-link the
 size of the cluster needed to process the data from the size of the data.

 DR


 On 02/03/2015 11:43 AM, Joe Wass wrote:

 I want to process about 800 GB of data on an Amazon EC2 cluster. So, I
 need
 to store the input in HDFS somehow.

 I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk.
 Each HDFS node reports 73 GB, and the total capacity is ~370 GB.

 If I want to process 800 GB of data (assuming I can't split the jobs up),
 I'm guessing I need to get persistent-hdfs involved.

 1 - Does persistent-hdfs have noticeably different performance than
 ephemeral-hdfs?
 2 - If so, is there a recommended configuration (like storing input and
 output on persistent, but persisted RDDs on ephemeral?)

 This seems like a common use-case, so sorry if this has already been
 covered.

 Joe



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
http://sites.google.com/site/krasser/?utm_source=sig

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

Re: ephemeral-hdfs vs persistent-hdfs - performance

10 matches

Site Navigation

Mail list logo

Footer information