Yes I would just reuse the same function. On Sun, Jul 8, 2018 at 5:01 AM Li Jin <ice.xell...@gmail.com> wrote:
> Hi Linar, > > This seems useful. But perhaps reusing the same function name is better? > > > http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame > > Currently createDataFrame takes an RDD of any kind of SQL data > representation(e.g. row, tuple, int, boolean, etc.), or list, or > pandas.DataFrame. > > Perhaps we can support taking an RDD of *pandas.DataFrame *as the "data" > args too? > > What do other people think. > > Li > > On Sun, Jul 8, 2018 at 1:13 PM, Linar Savion <li...@jether-energy.com> > wrote: > >> We've created a snippet that creates a Spark DF from a RDD of many pandas >> DFs in a distributed manner that does not require the driver to collect the >> entire dataset. >> >> Early tests show a performance improvement of x6-x10 over using >> pandasDF->Rows>sparkDF. >> >> I've seen that there are some open pull requests that change the way >> arrow serialization work, Should I open a pull request to add this >> functionality to SparkSession? (`createFromPandasDataframesRDD`) >> >> https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5 >> >> Thanks, >> Linar >> > >