General question on persist
I have a general question on when persisting will be beneficial and when it won't: I have a task that runs as follow keyedRecordPieces = records.flatMap( record = Seq(key, recordPieces)) partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() Is there value in having a persist somewhere here? For example if the flatMap step is particularly expensive, will it ever be computed twice when there are no failures? Thanks Arun
Re: General question on persist
Hi Arun, The intermediate results like keyedRecordPieces will not be materialized. This indicates that if you run partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() again, the keyedRecordPieces will be re-computed . In this case, cache or persist keyedRecordPieces is a good idea to eliminate unnecessary expensive computation. What you can probably do is keyedRecordPieces = records.flatMap( record = Seq(key, recordPieces)).cache() Which will cache the RDD referenced by keyedRecordPieces in memory. For more options on cache and persist, take a look at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD. There are two APIs you can use to persist RDDs and one allows you to specify storage level. Thanks, Liquan On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja aahuj...@gmail.com wrote: I have a general question on when persisting will be beneficial and when it won't: I have a task that runs as follow keyedRecordPieces = records.flatMap( record = Seq(key, recordPieces)) partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() Is there value in having a persist somewhere here? For example if the flatMap step is particularly expensive, will it ever be computed twice when there are no failures? Thanks Arun -- Liquan Pei Department of Physics University of Massachusetts Amherst
Re: General question on persist
Thanks Liquan, that makes sense, but if I am only doin the computation once, there will essentially be no difference, correct? I had second question related to mapPartitions 1) All of the records of the Iterator[T] that a single function call in mapPartitions process must fit into memory, correct? 2) Is there someway to process that iterator in sorted order? Thanks! Arun On Tue, Sep 23, 2014 at 5:21 PM, Liquan Pei liquan...@gmail.com wrote: Hi Arun, The intermediate results like keyedRecordPieces will not be materialized. This indicates that if you run partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() again, the keyedRecordPieces will be re-computed . In this case, cache or persist keyedRecordPieces is a good idea to eliminate unnecessary expensive computation. What you can probably do is keyedRecordPieces = records.flatMap( record = Seq(key, recordPieces)).cache() Which will cache the RDD referenced by keyedRecordPieces in memory. For more options on cache and persist, take a look at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD. There are two APIs you can use to persist RDDs and one allows you to specify storage level. Thanks, Liquan On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja aahuj...@gmail.com wrote: I have a general question on when persisting will be beneficial and when it won't: I have a task that runs as follow keyedRecordPieces = records.flatMap( record = Seq(key, recordPieces)) partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() Is there value in having a persist somewhere here? For example if the flatMap step is particularly expensive, will it ever be computed twice when there are no failures? Thanks Arun -- Liquan Pei Department of Physics University of Massachusetts Amherst