Re: RDD Persistance synchronization
Thanks to you again, Sean. The thing is that, we persist and count that RDD in hope that all later actions with it won't trigger previous recalculations, it's not really about performance here, it's because recalculations contain UUID generation which should be the same for further actions. I understand that RDD concept is based on linage, and it kind of contradicts our goal but, is there ay way to guarantee that it's persisted, or make it fail when persisting fails? On 29 March 2015 at 12:51, Sean Owen so...@cloudera.com wrote: persist() completes immediately since it only marks the RDD for persistence. count() triggers computation of rdd, and as rdd is computed it will be persisted. The following transform should therefore only start after count() and therefore after the persistence completes. I think there might be corner cases where you still see some of rdd computed, like, if a persisted block is lost or otherwise unavailable later. On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan harut.martiros...@gmail.com wrote: Hi. rdd.persist() rdd.count() rdd.transform()... is there a chance transform() runs before persist() is complete? -- RGRDZ Harut -- RGRDZ Harut
Re: RDD Persistance synchronization
persist() completes immediately since it only marks the RDD for persistence. count() triggers computation of rdd, and as rdd is computed it will be persisted. The following transform should therefore only start after count() and therefore after the persistence completes. I think there might be corner cases where you still see some of rdd computed, like, if a persisted block is lost or otherwise unavailable later. On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan harut.martiros...@gmail.com wrote: Hi. rdd.persist() rdd.count() rdd.transform()... is there a chance transform() runs before persist() is complete? -- RGRDZ Harut - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RDD Persistance synchronization
Hi. rdd.persist() rdd.count() rdd.transform()... is there a chance transform() runs before persist() is complete? -- RGRDZ Harut
Re: RDD Persistance synchronization
I don't think you can guarantee that there is no recomputation. Even if you persist(), you might lose the block and have to recompute it. You can persist your UUIDs to storage like HDFS. They won't change then of course. I suppose you still face a much narrower problem, that the act of computing the UUIDs in order to immediately save them may fail, and restart. Downstream processes would only ever observe one set of UUIDs though, even if inside the process, some UUIDs were created and lost. I suppose that only might matter if you need sequential IDs or something, but then, you're getting into territory where you need a different model of computation from Spark. On Sun, Mar 29, 2015 at 10:02 AM, Harut Martirosyan harut.martiros...@gmail.com wrote: Thanks to you again, Sean. The thing is that, we persist and count that RDD in hope that all later actions with it won't trigger previous recalculations, it's not really about performance here, it's because recalculations contain UUID generation which should be the same for further actions. I understand that RDD concept is based on linage, and it kind of contradicts our goal but, is there ay way to guarantee that it's persisted, or make it fail when persisting fails? On 29 March 2015 at 12:51, Sean Owen so...@cloudera.com wrote: persist() completes immediately since it only marks the RDD for persistence. count() triggers computation of rdd, and as rdd is computed it will be persisted. The following transform should therefore only start after count() and therefore after the persistence completes. I think there might be corner cases where you still see some of rdd computed, like, if a persisted block is lost or otherwise unavailable later. On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan harut.martiros...@gmail.com wrote: Hi. rdd.persist() rdd.count() rdd.transform()... is there a chance transform() runs before persist() is complete? -- RGRDZ Harut -- RGRDZ Harut - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org