Hi Linar, This seems useful. But perhaps reusing the same function name is better?
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame Currently createDataFrame takes an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean, etc.), or list, or pandas.DataFrame. Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data" args too? What do other people think. Li On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <li...@jether-energy.com> wrote: > We've created a snippet that creates a Spark DF from a RDD of many pandas > DFs in a distributed manner that does not require the driver to collect the > entire dataset. > > Early tests show a performance improvement of x6-x10 over using > pandasDF->Rows>sparkDF. > > I've seen that there are some open pull requests that change the way arrow > serialization work, Should I open a pull request to add this functionality > to SparkSession? (`createFromPandasDataframesRDD`) > > https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5 > > Thanks, > Linar >