emmm, I haven't check code, but I think if an RDD is referenced in several
places, the correct behavior should be: when this RDD data is needed, it
will be computed and then cached only once, otherwise it should be treated
as a bug. If you are suspicious there's a race condition, you could create
a jira ticket.

On Mon, Nov 25, 2019 at 12:21 PM Chang Chen <baibaic...@gmail.com> wrote:

> Sorry I did't describe clearly,  RDD id itself is thread-safe, how about
> cached data?
>
> See codes from BlockManager
>
> def getOrElseUpdate(...)   = {
>   get[T](blockId)(classTag) match {
>    case ...
>    case _ =>                                      // 1. no data is cached.
>     // Need to compute the block
>  }
>  // Initially we hold no locks on this block
>  doPutIterator(...) match{..}
> }
>
> Considering  two DAGs (contain the same cached RDD ) runs simultaneously,
> if both returns none  when they get same block from BlockManager(i.e. #1
> above), then I guess the same data would be cached twice.
>
> If the later cache could override the previous data, and no memory is
> waste, then this is OK
>
> Thanks
> Chang
>
>
> Weichen Xu <weichen...@databricks.com> 于2019年11月25日周一 上午11:52写道:
>
>> Rdd id is immutable and when rdd object created, the rdd id is generated.
>> So why there is race condition in "rdd id" ?
>>
>> On Mon, Nov 25, 2019 at 11:31 AM Chang Chen <baibaic...@gmail.com> wrote:
>>
>>> I am wonder the concurrent semantics for reason about the correctness.
>>> If the two query simultaneously run the DAGs which use the same cached
>>> DF\RDD,but before cache data actually happen, what will happen?
>>>
>>> By looking into code a litter, I suspect they have different BlockID for
>>> same Dataset which is unexpected behavior, but there is no race condition.
>>>
>>> However RDD id is not lazy, so there is race condition.
>>>
>>> Thanks
>>> Chang
>>>
>>>
>>> Weichen Xu <weichen...@databricks.com> 于2019年11月12日周二 下午1:22写道:
>>>
>>>> Hi Chang,
>>>>
>>>> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>>>>
>>>> Thanks!
>>>>
>>>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen <baibaic...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all
>>>>>
>>>>> I meet a case where I need cache a source RDD, and then create
>>>>> different DataFrame from it in different threads to accelerate query.
>>>>>
>>>>> I know that SparkSession is thread safe(
>>>>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>>>>> whether RDD  is thread safe or not
>>>>>
>>>>> Thanks
>>>>>
>>>>

Reply via email to