[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Linar Savion Sun, 08 Jul 2018 04:13:14 -0700

We've created a snippet that creates a Spark DF from a RDD of many pandas
DFs in a distributed manner that does not require the driver to collect the
entire dataset.


Early tests show a performance improvement of x6-x10 over using
pandasDF->Rows>sparkDF.

I've seen that there are some open pull requests that change the way arrow
serialization work, Should I open a pull request to add this functionality
to SparkSession? (`createFromPandasDataframesRDD`)

https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

Thanks,
Linar

[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Reply via email to