Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Li Jin Sun, 08 Jul 2018 05:01:30 -0700

Hi Linar,

This seems useful. But perhaps reusing the same function name is better?

http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame

Currently createDataFrame takes an RDD of any kind of SQL data
representation(e.g. row, tuple, int, boolean, etc.), or list, or
pandas.DataFrame.

Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data"
args too?

What do other people think.

Li

On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <li...@jether-energy.com>
wrote:

> We've created a snippet that creates a Spark DF from a RDD of many pandas
> DFs in a distributed manner that does not require the driver to collect the
> entire dataset.
>
> Early tests show a performance improvement of x6-x10 over using
> pandasDF->Rows>sparkDF.
>
> I've seen that there are some open pull requests that change the way arrow
> serialization work, Should I open a pull request to add this functionality
> to SparkSession? (`createFromPandasDataframesRDD`)
>
> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
>
> Thanks,
> Linar
>

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Reply via email to