I am wonder the concurrent semantics for reason about the correctness. If
the two query simultaneously run the DAGs which use the same cached
DF\RDD,but before cache data actually happen, what will happen?

By looking into code a litter, I suspect they have different BlockID for
same Dataset which is unexpected behavior, but there is no race condition.

However RDD id is not lazy, so there is race condition.

Thanks
Chang


Weichen Xu <weichen...@databricks.com> 于2019年11月12日周二 下午1:22写道:

> Hi Chang,
>
> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>
> Thanks!
>
> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen <baibaic...@gmail.com> wrote:
>
>> Hi all
>>
>> I meet a case where I need cache a source RDD, and then create different
>> DataFrame from it in different threads to accelerate query.
>>
>> I know that SparkSession is thread safe(
>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>> whether RDD  is thread safe or not
>>
>> Thanks
>>
>

Reply via email to