I know spark takes care of executing everything in a distributed manner,
however, spark also supports having multiple threads on the same spark
session/context and knows (Through fair scheduler) to distribute the tasks from
them in a round robin.
The question is, can those two actions (with a different set of
transformations) be applied to the SAME dataframe.
Let’s say I want to do something like:
Val df = ???
df.cache()
df.count()
def f1(df: DataFrame): Unit = {
val df1 = df.groupby(something).agg(some aggs)
df1.write.parquet(“some path”)
}
def f2(df: DataFrame): Unit = {
val df2 = df.groupby(something else).agg(some different aggs)
df2.write.parquet(“some path 2”)
}
f1(df)
f2(df)
df.unpersist()
if the aggregations do not use the full cluster (e.g. because of data skewness,
because there aren’t enough partitions or any other reason) then this would
leave the cluster under utilized.
However, if I would call f1 and f2 on different threads, then df2 can use free
resources f1 has not consumed and the overall utilization would improve.
Of course, I can do this only if the operations on the dataframe are thread
safe. For example, if I would do a cache in f1 and an unpersist in f2 I would
get an inconsistent result. So my question is, what, if any are the legal
operations to use on a dataframe so I could do the above.
Thanks,
Assaf.
From: Jörn Franke [mailto:[email protected]]
Sent: Sunday, February 12, 2017 10:39 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: is dataframe thread safe?
I am not sure what you are trying to achieve here. Spark is taking care of
executing the transformations in a distributed fashion. This means you must not
use threads - it does not make sense. Hence, you do not find documentation
about it.
On 12 Feb 2017, at 09:06, Mendelson, Assaf
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I was wondering if dataframe is considered thread safe. I know the spark
session and spark context are thread safe (and actually have tools to manage
jobs from different threads) but the question is, can I use the same dataframe
in both threads.
The idea would be to create a dataframe in the main thread and then in two sub
threads do different transformations and actions on it.
I understand that some things might not be thread safe (e.g. if I unpersist in
one thread it would affect the other. Checkpointing would cause similar
issues), however, I can’t find any documentation as to what operations (if any)
are thread safe.
Thanks,
Assaf.