We've created a snippet that creates a Spark DF from a RDD of many pandas DFs in a distributed manner that does not require the driver to collect the entire dataset.
Early tests show a performance improvement of x6-x10 over using pandasDF->Rows>sparkDF. I've seen that there are some open pull requests that change the way arrow serialization work, Should I open a pull request to add this functionality to SparkSession? (`createFromPandasDataframesRDD`) https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5 Thanks, Linar