spark git commit: [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
Repository: spark Updated Branches: refs/heads/master 2bfd5accd -> d03aebbe6 [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas ## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan CutlerAuthor: Li Jin Author: Li Jin Author: Wes McKinney Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d03aebbe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d03aebbe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d03aebbe Branch: refs/heads/master Commit: d03aebbe6508ba441dc87f9546f27aeb27553d77 Parents: 2bfd5ac Author: Bryan Cutler Authored: Mon Jul 10 15:21:03 2017 -0700 Committer: Holden Karau Committed: Mon Jul 10 15:21:03 2017 -0700 -- bin/pyspark |2 +- dev/deps/spark-deps-hadoop-2.6 |5 + dev/deps/spark-deps-hadoop-2.7 |5 + pom.xml | 20 + python/pyspark/serializers.py | 17 + python/pyspark/sql/dataframe.py | 48 +- python/pyspark/sql/tests.py | 78 +- .../org/apache/spark/sql/internal/SQLConf.scala | 22 + sql/core/pom.xml|4 + .../scala/org/apache/spark/sql/Dataset.scala| 20 + .../sql/execution/arrow/ArrowConverters.scala | 429 ++ .../execution/arrow/ArrowConvertersSuite.scala | 1222 ++ 12 files changed, 1859 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d03aebbe/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index d3b512e..dd28627 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -68,7 +68,7 @@ if [[ -n "$SPARK_TESTING" ]]; then unset YARN_CONF_DIR unset HADOOP_CONF_DIR export PYTHONHASHSEED=0 - exec "$PYSPARK_DRIVER_PYTHON" -m "$1" + exec "$PYSPARK_DRIVER_PYTHON" -m "$@" exit fi http://git-wip-us.apache.org/repos/asf/spark/blob/d03aebbe/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index c132531..1a6515b 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -13,6 +13,9 @@ apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar arpack_combined_all-0.1.jar +arrow-format-0.4.0.jar +arrow-memory-0.4.0.jar +arrow-vector-0.4.0.jar avro-1.7.7.jar avro-ipc-1.7.7.jar avro-mapred-1.7.7-hadoop2.jar @@ -55,6 +58,7 @@ datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar derby-10.12.1.1.jar eigenbase-properties-1.1.5.jar +flatbuffers-1.2.0-3f79e055.jar gson-2.2.4.jar guava-14.0.1.jar guice-3.0.jar @@ -77,6 +81,7 @@ hadoop-yarn-server-web-proxy-2.6.5.jar hk2-api-2.4.0-b34.jar hk2-locator-2.4.0-b34.jar
spark git commit: [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
Repository: spark Updated Branches: refs/heads/master 58434acdd -> e44697606 [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas ## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan CutlerAuthor: Li Jin Author: Li Jin Author: Wes McKinney Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e4469760 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e4469760 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e4469760 Branch: refs/heads/master Commit: e44697606f429b01808c1a22cb44cb5b89585c5c Parents: 58434ac Author: Bryan Cutler Authored: Fri Jun 23 09:01:13 2017 +0800 Committer: Wenchen Fan Committed: Fri Jun 23 09:01:13 2017 +0800 -- bin/pyspark |2 +- dev/deps/spark-deps-hadoop-2.6 |5 + dev/deps/spark-deps-hadoop-2.7 |5 + dev/run-pip-tests |6 + pom.xml | 20 + python/pyspark/serializers.py | 17 + python/pyspark/sql/dataframe.py | 48 +- python/pyspark/sql/tests.py | 79 +- .../org/apache/spark/sql/internal/SQLConf.scala | 22 + sql/core/pom.xml|4 + .../scala/org/apache/spark/sql/Dataset.scala| 20 + .../sql/execution/arrow/ArrowConverters.scala | 429 ++ .../execution/arrow/ArrowConvertersSuite.scala | 1222 ++ 13 files changed, 1866 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e4469760/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index 98387c2..8eeea77 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -68,7 +68,7 @@ if [[ -n "$SPARK_TESTING" ]]; then unset YARN_CONF_DIR unset HADOOP_CONF_DIR export PYTHONHASHSEED=0 - exec "$PYSPARK_DRIVER_PYTHON" -m "$1" + exec "$PYSPARK_DRIVER_PYTHON" -m "$@" exit fi http://git-wip-us.apache.org/repos/asf/spark/blob/e4469760/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 9287bd4..9868c1a 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -13,6 +13,9 @@ apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar arpack_combined_all-0.1.jar +arrow-format-0.4.0.jar +arrow-memory-0.4.0.jar +arrow-vector-0.4.0.jar avro-1.7.7.jar avro-ipc-1.7.7.jar avro-mapred-1.7.7-hadoop2.jar @@ -55,6 +58,7 @@ datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar derby-10.12.1.1.jar eigenbase-properties-1.1.5.jar +flatbuffers-1.2.0-3f79e055.jar gson-2.2.4.jar guava-14.0.1.jar guice-3.0.jar @@ -77,6 +81,7 @@ hadoop-yarn-server-web-proxy-2.6.5.jar hk2-api-2.4.0-b34.jar hk2-locator-2.4.0-b34.jar