[ https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250029#comment-16250029 ]
Apache Spark commented on SPARK-20791: -------------------------------------- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/19738 > Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame > ----------------------------------------------------------------------- > > Key: SPARK-20791 > URL: https://issues.apache.org/jira/browse/SPARK-20791 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.1.1 > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Fix For: 2.3.0 > > > The current code for creating a Spark DataFrame from a Pandas DataFrame uses > `to_records` to convert the DataFrame to a list of records and then converts > each record to a list. Following this, there are a number of calls to > serialize and transfer this data to the JVM. This process is very > inefficient and also discards all schema metadata, requiring another pass > over the data to infer types. > Using Apache Arrow, the Pandas DataFrame could be efficiently converted to > Arrow data and directly transferred to the JVM to create the Spark DataFrame. > The performance will be better and the Pandas schema will also be used so > that the correct types will be used. > Issues with the poor type inference have come up before, causing confusion > and frustration with users because it is not clear why it fails or doesn't > use the same type from Pandas. Fixing this with Apache Arrow will solve > another pain point for Python users and the following JIRAs could be closed: > * SPARK-17804 > * SPARK-18178 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org