emmm, I haven't check code, but I think if an RDD is referenced in several places, the correct behavior should be: when this RDD data is needed, it will be computed and then cached only once, otherwise it should be treated as a bug. If you are suspicious there's a race condition, you could create a jira ticket.
On Mon, Nov 25, 2019 at 12:21 PM Chang Chen <baibaic...@gmail.com> wrote: > Sorry I did't describe clearly, RDD id itself is thread-safe, how about > cached data? > > See codes from BlockManager > > def getOrElseUpdate(...) = { > get[T](blockId)(classTag) match { > case ... > case _ => // 1. no data is cached. > // Need to compute the block > } > // Initially we hold no locks on this block > doPutIterator(...) match{..} > } > > Considering two DAGs (contain the same cached RDD ) runs simultaneously, > if both returns none when they get same block from BlockManager(i.e. #1 > above), then I guess the same data would be cached twice. > > If the later cache could override the previous data, and no memory is > waste, then this is OK > > Thanks > Chang > > > Weichen Xu <weichen...@databricks.com> 于2019年11月25日周一 上午11:52写道: > >> Rdd id is immutable and when rdd object created, the rdd id is generated. >> So why there is race condition in "rdd id" ? >> >> On Mon, Nov 25, 2019 at 11:31 AM Chang Chen <baibaic...@gmail.com> wrote: >> >>> I am wonder the concurrent semantics for reason about the correctness. >>> If the two query simultaneously run the DAGs which use the same cached >>> DF\RDD,but before cache data actually happen, what will happen? >>> >>> By looking into code a litter, I suspect they have different BlockID for >>> same Dataset which is unexpected behavior, but there is no race condition. >>> >>> However RDD id is not lazy, so there is race condition. >>> >>> Thanks >>> Chang >>> >>> >>> Weichen Xu <weichen...@databricks.com> 于2019年11月12日周二 下午1:22写道: >>> >>>> Hi Chang, >>>> >>>> RDD/Dataframe is immutable and lazy computed. They are thread safe. >>>> >>>> Thanks! >>>> >>>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen <baibaic...@gmail.com> >>>> wrote: >>>> >>>>> Hi all >>>>> >>>>> I meet a case where I need cache a source RDD, and then create >>>>> different DataFrame from it in different threads to accelerate query. >>>>> >>>>> I know that SparkSession is thread safe( >>>>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure >>>>> whether RDD is thread safe or not >>>>> >>>>> Thanks >>>>> >>>>