[ https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Li Jin updated SPARK-20791: --------------------------- Issue Type: Sub-task (was: New Feature) Parent: SPARK-22216 > Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame > ----------------------------------------------------------------------- > > Key: SPARK-20791 > URL: https://issues.apache.org/jira/browse/SPARK-20791 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.1.1 > Reporter: Bryan Cutler > > The current code for creating a Spark DataFrame from a Pandas DataFrame uses > `to_records` to convert the DataFrame to a list of records and then converts > each record to a list. Following this, there are a number of calls to > serialize and transfer this data to the JVM. This process is very > inefficient and also discards all schema metadata, requiring another pass > over the data to infer types. > Using Apache Arrow, the Pandas DataFrame could be efficiently converted to > Arrow data and directly transferred to the JVM to create the Spark DataFrame. > The performance will be better and the Pandas schema will also be used so > that the correct types will be used. > Issues with the poor type inference have come up before, causing confusion > and frustration with users because it is not clear why it fails or doesn't > use the same type from Pandas. Fixing this with Apache Arrow will solve > another pain point for Python users and the following JIRAs could be closed: > * SPARK-17804 > * SPARK-18178 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org