Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Reynold Xin
Yes I would just reuse the same function. On Sun, Jul 8, 2018 at 5:01 AM Li Jin wrote: > Hi Linar, > > This seems useful. But perhaps reusing the same function name is better? > > > http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame > > Curren

Re: [SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Li Jin
Hi Linar, This seems useful. But perhaps reusing the same function name is better? http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame Currently createDataFrame takes an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean,

[SPARK][SQL] Distributed createDataframe from many pandas DFs using Arrow

2018-07-08 Thread Linar Savion
We've created a snippet that creates a Spark DF from a RDD of many pandas DFs in a distributed manner that does not require the driver to collect the entire dataset. Early tests show a performance improvement of x6-x10 over using pandasDF->Rows>sparkDF. I've seen that there are some open pull req