the RDD or the dataframe is distributed and partitioned by Spark so as to leverage all your workers (CPUs) effectively. So all the Dataframe operations are actually happening simultaneously on a section of the data. Why do you want to use threading here?
Thanks, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Tue, Nov 12, 2019 at 7:18 AM Chang Chen <baibaic...@gmail.com> wrote: > > Hi all > > I meet a case where I need cache a source RDD, and then create different > DataFrame from it in different threads to accelerate query. > > I know that SparkSession is thread safe( > https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure > whether RDD si thread safe or not > > Thanks > Chang >