Re: No disk single pass RDD aggregation
Hi, This was all my fault. It turned out I had a line of code buried in a library that did a repartition. I used this library to wrap an RDD to present it to legacy code as a different interface. That's what was causing the data to spill to disk. The really stupid thing is it took me the better part of a day to find and several misguided emails to this list (including the one that started this thread). Sorry about that. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20763.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No disk single pass RDD aggregation
Jim Carroll wrote Okay, I have an rdd that I want to run an aggregate over but it insists on spilling to disk even though I structured the processing to only require a single pass. In other words, I can do all of my processing one entry in the rdd at a time without persisting anything. I set rdd.persist(StorageLevel.NONE) and it had no affect. When I run locally I get my /tmp directory filled with transient rdd data even though I never need the data again after the row's been processed. Is there a way to turn this off? Thanks Jim hi, Did you have many input file? If it is, try to use conf.set(spark.shuffle.consolidateFiles, true); Hope this help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20753.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No disk single pass RDD aggregation
In case a little more information is helpful: the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a .gz file (that's too big to fit on my disk). I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect. This rdd is what I'm calling aggregate on and I expect to only use it once. Each row in the rdd never has to be revisited. The aggregate seqOp is modifying a current state and returning it so there's no need to store the results of the seqOp on a row-by-row basis, and give the fact that there's one partition the comboOp doesn't even need to be called (since there would be nothing to combine across partitions). Thanks for any help. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20724.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No disk single pass RDD aggregation
Nvm. I'm going to post another question since this has to do with the way spark handles sc.textFile with a file://.gz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org