Spark dataframe creation through already distributed in-memory data sets

Tanveer Ahmad - EWI Tue, 16 Jun 2020 07:12:49 -0700

Hi all,


I am new to the Spark community. Please ignore if this question doesn't make 
sense.


My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', 
but moving data is much expensive (> 14 sec).


Explanation:

I have a huge Arrow RecordBatches collection which is equally distributed on 
all of my worker node's memories (in plasma_store). Currently, I am collecting 
all those RecordBatches back in my master node, merging them, and converting 
them to a single Spark Dataframe. Then I apply sorting function on that 
dataframe.


Spark dataframe is a cluster distributed data collection.


So my question is:

Is it possible to create a Spark dataframe from all that already distributed 
Arrow RecordBatches data collections in the worker's nodes memories? So that 
the data should remain in the respective worker's nodes memories (instead of 
bringing it to master node, merging, and then creating distributed dataframe).


Thanks!


Regards,
Tanveer Ahmad

Spark dataframe creation through already distributed in-memory data sets

Reply via email to