Re: Is RDD thread safe?

2019-11-25 Thread Chang Chen
Thank you Imran I will check whether there is memory waste or not Imran Rashid 于2019年11月26日周二 上午1:30写道: > I think Chang is right, but I also think this only comes up in limited > scenarios. I initially thought it wasn't a bug, but after some more > thought I have some concerns in light of the

Re: Is RDD thread safe?

2019-11-25 Thread Mridul Muralidharan
Very well put Imran. This is a variant of executor failure after an RDD has been computed (including caching). In general, non determinism in spark is going to lead to inconsistency. The only reasonable solution for us, at that time, was to make pseudo-randomness repeatable and checkpoint after so

Re: Is RDD thread safe?

2019-11-25 Thread Imran Rashid
I think Chang is right, but I also think this only comes up in limited scenarios. I initially thought it wasn't a bug, but after some more thought I have some concerns in light of the issues we've had w/ nondeterministic RDDs, eg. repartition(). Say I have code like this: val cachedRDD =

Re: Is RDD thread safe?

2019-11-25 Thread Weichen Xu
emmm, I haven't check code, but I think if an RDD is referenced in several places, the correct behavior should be: when this RDD data is needed, it will be computed and then cached only once, otherwise it should be treated as a bug. If you are suspicious there's a race condition, you could create

Re: Is RDD thread safe?

2019-11-24 Thread Chang Chen
Sorry I did't describe clearly, RDD id itself is thread-safe, how about cached data? See codes from BlockManager def getOrElseUpdate(...) = { get[T](blockId)(classTag) match { case ... case _ => // 1. no data is cached. // Need to compute the

Re: Is RDD thread safe?

2019-11-24 Thread Weichen Xu
Rdd id is immutable and when rdd object created, the rdd id is generated. So why there is race condition in "rdd id" ? On Mon, Nov 25, 2019 at 11:31 AM Chang Chen wrote: > I am wonder the concurrent semantics for reason about the correctness. If > the two query simultaneously run the DAGs which

Re: Is RDD thread safe?

2019-11-24 Thread Chang Chen
I am wonder the concurrent semantics for reason about the correctness. If the two query simultaneously run the DAGs which use the same cached DF\RDD,but before cache data actually happen, what will happen? By looking into code a litter, I suspect they have different BlockID for same Dataset which

Re: Is RDD thread safe?

2019-11-11 Thread Weichen Xu
Hi Chang, RDD/Dataframe is immutable and lazy computed. They are thread safe. Thanks! On Tue, Nov 12, 2019 at 12:31 PM Chang Chen wrote: > Hi all > > I meet a case where I need cache a source RDD, and then create different > DataFrame from it in different threads to accelerate query. > > I

Is RDD thread safe?

2019-11-11 Thread Chang Chen
Hi all I meet a case where I need cache a source RDD, and then create different DataFrame from it in different threads to accelerate query. I know that SparkSession is thread safe( https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure whether RDD is thread safe or not Thanks