We've created a snippet that creates a Spark DF from a RDD of many pandas
DFs in a distributed manner that does not require the driver to collect the
entire dataset.

Early tests show a performance improvement of x6-x10 over using
pandasDF->Rows>sparkDF.

I've seen that there are some open pull requests that change the way arrow
serialization work, Should I open a pull request to add this functionality
to SparkSession? (`createFromPandasDataframesRDD`)

https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

Thanks,
Linar

Reply via email to