Re: RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Thanks to you again, Sean.

The thing is that, we persist and count that RDD in hope that all later
actions with it won't trigger previous recalculations, it's not really
about performance here, it's because recalculations contain UUID generation
which should be the same for further actions.

I understand that RDD concept is based on linage, and it kind of
contradicts our goal but, is there ay way to guarantee that it's persisted,
or make it fail when persisting fails?

On 29 March 2015 at 12:51, Sean Owen so...@cloudera.com wrote:

 persist() completes immediately since it only marks the RDD for
 persistence. count() triggers computation of rdd, and as rdd is
 computed it will be persisted. The following transform should
 therefore only start after count() and therefore after the persistence
 completes. I think there might be corner cases where you still see
 some of rdd computed, like, if a persisted block is lost or otherwise
 unavailable later.

 On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan
 harut.martiros...@gmail.com wrote:
  Hi.
 
  rdd.persist()
  rdd.count()
 
  rdd.transform()...
 
  is there a chance transform() runs before persist() is complete?
 
  --
  RGRDZ Harut




-- 
RGRDZ Harut


Re: RDD Persistance synchronization

2015-03-29 Thread Sean Owen
persist() completes immediately since it only marks the RDD for
persistence. count() triggers computation of rdd, and as rdd is
computed it will be persisted. The following transform should
therefore only start after count() and therefore after the persistence
completes. I think there might be corner cases where you still see
some of rdd computed, like, if a persisted block is lost or otherwise
unavailable later.

On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan
harut.martiros...@gmail.com wrote:
 Hi.

 rdd.persist()
 rdd.count()

 rdd.transform()...

 is there a chance transform() runs before persist() is complete?

 --
 RGRDZ Harut

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RDD Persistance synchronization

2015-03-29 Thread Harut Martirosyan
Hi.

rdd.persist()
rdd.count()

rdd.transform()...

is there a chance transform() runs before persist() is complete?

-- 
RGRDZ Harut


Re: RDD Persistance synchronization

2015-03-29 Thread Sean Owen
I don't think you can guarantee that there is no recomputation. Even
if you persist(), you might lose the block and have to recompute it.

You can persist your UUIDs to storage like HDFS. They won't change
then of course. I suppose you still face a much narrower problem, that
the act of computing the UUIDs in order to immediately save them may
fail, and restart. Downstream processes would only ever observe one
set of UUIDs though, even if inside the process, some UUIDs were
created and lost. I suppose that only might matter if you need
sequential IDs or something, but then, you're getting into territory
where you need a different model of computation from Spark.

On Sun, Mar 29, 2015 at 10:02 AM, Harut Martirosyan
harut.martiros...@gmail.com wrote:
 Thanks to you again, Sean.

 The thing is that, we persist and count that RDD in hope that all later
 actions with it won't trigger previous recalculations, it's not really about
 performance here, it's because recalculations contain UUID generation which
 should be the same for further actions.

 I understand that RDD concept is based on linage, and it kind of contradicts
 our goal but, is there ay way to guarantee that it's persisted, or make it
 fail when persisting fails?

 On 29 March 2015 at 12:51, Sean Owen so...@cloudera.com wrote:

 persist() completes immediately since it only marks the RDD for
 persistence. count() triggers computation of rdd, and as rdd is
 computed it will be persisted. The following transform should
 therefore only start after count() and therefore after the persistence
 completes. I think there might be corner cases where you still see
 some of rdd computed, like, if a persisted block is lost or otherwise
 unavailable later.

 On Sun, Mar 29, 2015 at 9:07 AM, Harut Martirosyan
 harut.martiros...@gmail.com wrote:
  Hi.
 
  rdd.persist()
  rdd.count()
 
  rdd.transform()...
 
  is there a chance transform() runs before persist() is complete?
 
  --
  RGRDZ Harut




 --
 RGRDZ Harut

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org