I would like to have your opinion about an idea I had... I am thinking of answering the issue of interactive query on small/medium dataset (max 500 GB or 1 TB) with a solution based on the thriftserver and spark cache management. Currently the problem of caching the dataset in Spark is that you cannot have a high data freshness and the cache isn't resilient. If dataframe is thread safe, would it be possible to implement a cache management strategy that periodically refresh the cached dataset from the backends ?
Another question regarding the persist MEMORY_AND_DISK, what is the promote/eviction strategy implemented ? Is FIFO, LIFO, heat based ? Note: I already know Alluxio and it could potentially also solve this issue, my question is on Spark only, I would like to benefits from tungsten project and the no-serialization options... 2017-02-15 9:05 GMT+01:00 萝卜丝炒饭 <1427357...@qq.com>: > updating dataframe returns NEW dataframe like RDD please? > > ---Original--- > *From:* "vincent gromakowski"<vincent.gromakow...@gmail.com> > *Date:* 2017/2/14 01:15:35 > *To:* "Reynold Xin"<r...@databricks.com>; > *Cc:* "user"<user@spark.apache.org>;"Mendelson, Assaf"< > assaf.mendel...@rsa.com>; > *Subject:* Re: is dataframe thread safe? > > How about having a thread that update and cache a dataframe in-memory next > to other threads requesting this dataframe, is it thread safe ? > > 2017-02-13 9:02 GMT+01:00 Reynold Xin <r...@databricks.com>: > >> Yes your use case should be fine. Multiple threads can transform the same >> data frame in parallel since they create different data frames. >> >> >> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf <assaf.mendel...@rsa.com> >> wrote: >> >>> Hi, >>> >>> I was wondering if dataframe is considered thread safe. I know the spark >>> session and spark context are thread safe (and actually have tools to >>> manage jobs from different threads) but the question is, can I use the same >>> dataframe in both threads. >>> >>> The idea would be to create a dataframe in the main thread and then in >>> two sub threads do different transformations and actions on it. >>> >>> I understand that some things might not be thread safe (e.g. if I >>> unpersist in one thread it would affect the other. Checkpointing would >>> cause similar issues), however, I can’t find any documentation as to what >>> operations (if any) are thread safe. >>> >>> >>> >>> Thanks, >>> >>> Assaf. >>> >> >