If you update the data, then you don't have the same DataFrame anymore. If
you don't do like Assaf did, caching and forcing evaluation of the
DataFrame before using that DataFrame concurrently, then you'll still get
consistent and correct results, but not necessarily efficient results. If
the fully materialized, cached are not yet available when multiple
concurrent Jobs try to use the DataFrame, then you can end up with more
than one Job doing the same work to generate what needs to go in the cache.
To avoid that kind of work duplication you need some mechanism to ensure
that only one action/Job is run to populate the cache before multiple
actions/Jobs can then use the cached results efficiently.

On Mon, Feb 13, 2017 at 9:15 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> How about having a thread that update and cache a dataframe in-memory next
> to other threads requesting this dataframe, is it thread safe ?
>
> 2017-02-13 9:02 GMT+01:00 Reynold Xin <r...@databricks.com>:
>
>> Yes your use case should be fine. Multiple threads can transform the same
>> data frame in parallel since they create different data frames.
>>
>>
>> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf <assaf.mendel...@rsa.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I was wondering if dataframe is considered thread safe. I know the spark
>>> session and spark context are thread safe (and actually have tools to
>>> manage jobs from different threads) but the question is, can I use the same
>>> dataframe in both threads.
>>>
>>> The idea would be to create a dataframe in the main thread and then in
>>> two sub threads do different transformations and actions on it.
>>>
>>> I understand that some things might not be thread safe (e.g. if I
>>> unpersist in one thread it would affect the other. Checkpointing would
>>> cause similar issues), however, I can’t find any documentation as to what
>>> operations (if any) are thread safe.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>                 Assaf.
>>>
>>
>

Reply via email to