Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Reynold Xin Sun, 08 Jul 2018 10:22:16 -0700

Yes I would just reuse the same function.

On Sun, Jul 8, 2018 at 5:01 AM Li Jin <ice.xell...@gmail.com> wrote:


> Hi Linar,
>
> This seems useful. But perhaps reusing the same function name is better?
>
>
> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
>
> Currently createDataFrame takes an RDD of any kind of SQL data
> representation(e.g. row, tuple, int, boolean, etc.), or list, or
> pandas.DataFrame.
>
> Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data"
> args too?
>
> What do other people think.
>
> Li
>
> On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <li...@jether-energy.com>
> wrote:
>
>> We've created a snippet that creates a Spark DF from a RDD of many pandas
>> DFs in a distributed manner that does not require the driver to collect the
>> entire dataset.
>>
>> Early tests show a performance improvement of x6-x10 over using
>> pandasDF->Rows>sparkDF.
>>
>> I've seen that there are some open pull requests that change the way
>> arrow serialization work, Should I open a pull request to add this
>> functionality to SparkSession? (`createFromPandasDataframesRDD`)
>>
>> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
>>
>> Thanks,
>> Linar
>>
>
>

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

Reply via email to