Re: No disk single pass RDD aggregation

2014-12-18 Thread Jim Carroll
Hi,

This was all my fault. It turned out I had a line of code buried in a
library that did a repartition. I used this library to wrap an RDD to
present it to legacy code as a different interface. That's what was causing
the data to spill to disk.

The really stupid thing is it took me the better part of a day to find and
several misguided emails to this list (including the one that started this
thread).

Sorry about that.

Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No disk single pass RDD aggregation

2014-12-17 Thread thanhtien522
Jim Carroll wrote
 Okay,
 
 I have an rdd that I want to run an aggregate over but it insists on
 spilling to disk even though I structured the processing to only require a
 single pass.
 
 In other words, I can do all of my processing one entry in the rdd at a
 time without persisting anything.
 
 I set rdd.persist(StorageLevel.NONE) and it had no affect. When I run
 locally I get my /tmp directory filled with transient rdd data even though
 I never need the data again after the row's been processed. Is there a way
 to turn this off?
 
 Thanks
 Jim

hi,
Did you have many input file?
If it is, try to use 

conf.set(spark.shuffle.consolidateFiles, true);

Hope this help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
In case a little more information is helpful:

the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a
.gz file (that's too big to fit on my disk).

I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect.

This rdd is what I'm calling aggregate on and I expect to only use it once.
Each row in the rdd never has to be revisited. The aggregate seqOp is
modifying a current state and returning it so there's no need to store the
results of the seqOp on a row-by-row basis, and give the fact that there's
one partition the comboOp doesn't even need to be called (since there would
be nothing to combine across partitions).

Thanks for any help.
Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20724.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
Nvm. I'm going to post another question since this has to do with the way
spark handles sc.textFile with a file://.gz



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org