spark git commit: [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas

2017-07-10 Thread holden
Repository: spark
Updated Branches:
  refs/heads/master 2bfd5accd -> d03aebbe6


[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of 
DataFrame.toPandas

## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of 
`DataFrame.toPandas`.  This has been done by using Arrow to convert data 
partitions on the executor JVM to Arrow payload byte arrays where they are then 
served to the Python process.  The Python DataFrame can then collect the Arrow 
payloads where they are combined and converted to a Pandas DataFrame.  Data 
types except complex, date, timestamp, and decimal  are currently supported, 
otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method 
`Dataset.toArrowPayload` that will convert data partitions in the executor JVM 
to `ArrowPayload`s as byte arrays so they can be easily served.  A package 
private class/object `ArrowConverters` that provide data type mappings and 
conversion routines.  In Python, a private method `DataFrame._collectAsArrow` 
is added to collect Arrow payloads and a SQLConf 
"spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using 
Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion 
of Datasets to Arrow payloads for supported types.  The suite will generate a 
Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow 
payload and finally validated against the JSON data.  This will ensure that the 
schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal 
DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas 
DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler 
Author: Li Jin 
Author: Li Jin 
Author: Wes McKinney 

Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d03aebbe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d03aebbe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d03aebbe

Branch: refs/heads/master
Commit: d03aebbe6508ba441dc87f9546f27aeb27553d77
Parents: 2bfd5ac
Author: Bryan Cutler 
Authored: Mon Jul 10 15:21:03 2017 -0700
Committer: Holden Karau 
Committed: Mon Jul 10 15:21:03 2017 -0700

--
 bin/pyspark |2 +-
 dev/deps/spark-deps-hadoop-2.6  |5 +
 dev/deps/spark-deps-hadoop-2.7  |5 +
 pom.xml |   20 +
 python/pyspark/serializers.py   |   17 +
 python/pyspark/sql/dataframe.py |   48 +-
 python/pyspark/sql/tests.py |   78 +-
 .../org/apache/spark/sql/internal/SQLConf.scala |   22 +
 sql/core/pom.xml|4 +
 .../scala/org/apache/spark/sql/Dataset.scala|   20 +
 .../sql/execution/arrow/ArrowConverters.scala   |  429 ++
 .../execution/arrow/ArrowConvertersSuite.scala  | 1222 ++
 12 files changed, 1859 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d03aebbe/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index d3b512e..dd28627 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -68,7 +68,7 @@ if [[ -n "$SPARK_TESTING" ]]; then
   unset YARN_CONF_DIR
   unset HADOOP_CONF_DIR
   export PYTHONHASHSEED=0
-  exec "$PYSPARK_DRIVER_PYTHON" -m "$1"
+  exec "$PYSPARK_DRIVER_PYTHON" -m "$@"
   exit
 fi
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d03aebbe/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index c132531..1a6515b 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -13,6 +13,9 @@ apacheds-kerberos-codec-2.0.0-M15.jar
 api-asn1-api-1.0.0-M20.jar
 api-util-1.0.0-M20.jar
 arpack_combined_all-0.1.jar
+arrow-format-0.4.0.jar
+arrow-memory-0.4.0.jar
+arrow-vector-0.4.0.jar
 avro-1.7.7.jar
 avro-ipc-1.7.7.jar
 avro-mapred-1.7.7-hadoop2.jar
@@ -55,6 +58,7 @@ datanucleus-core-3.2.10.jar
 datanucleus-rdbms-3.2.9.jar
 derby-10.12.1.1.jar
 eigenbase-properties-1.1.5.jar
+flatbuffers-1.2.0-3f79e055.jar
 gson-2.2.4.jar
 guava-14.0.1.jar
 guice-3.0.jar
@@ -77,6 +81,7 @@ hadoop-yarn-server-web-proxy-2.6.5.jar
 hk2-api-2.4.0-b34.jar
 hk2-locator-2.4.0-b34.jar
 

spark git commit: [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas

2017-06-22 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 58434acdd -> e44697606


[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of 
DataFrame.toPandas

## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of 
`DataFrame.toPandas`.  This has been done by using Arrow to convert data 
partitions on the executor JVM to Arrow payload byte arrays where they are then 
served to the Python process.  The Python DataFrame can then collect the Arrow 
payloads where they are combined and converted to a Pandas DataFrame.  All 
non-complex data types are currently supported, otherwise an 
`UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method 
`Dataset.toArrowPayloadBytes` that will convert data partitions in the executor 
JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package 
private class/object `ArrowConverters` that provide data type mappings and 
conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is 
added to collect Arrow payloads and an optional flag in 
`toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by 
default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion 
of Datasets to Arrow payloads for supported types.  The suite will generate a 
Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow 
payload and finally validated against the JSON data.  This will ensure that the 
schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal 
DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas 
DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler 
Author: Li Jin 
Author: Li Jin 
Author: Wes McKinney 

Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e4469760
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e4469760
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e4469760

Branch: refs/heads/master
Commit: e44697606f429b01808c1a22cb44cb5b89585c5c
Parents: 58434ac
Author: Bryan Cutler 
Authored: Fri Jun 23 09:01:13 2017 +0800
Committer: Wenchen Fan 
Committed: Fri Jun 23 09:01:13 2017 +0800

--
 bin/pyspark |2 +-
 dev/deps/spark-deps-hadoop-2.6  |5 +
 dev/deps/spark-deps-hadoop-2.7  |5 +
 dev/run-pip-tests   |6 +
 pom.xml |   20 +
 python/pyspark/serializers.py   |   17 +
 python/pyspark/sql/dataframe.py |   48 +-
 python/pyspark/sql/tests.py |   79 +-
 .../org/apache/spark/sql/internal/SQLConf.scala |   22 +
 sql/core/pom.xml|4 +
 .../scala/org/apache/spark/sql/Dataset.scala|   20 +
 .../sql/execution/arrow/ArrowConverters.scala   |  429 ++
 .../execution/arrow/ArrowConvertersSuite.scala  | 1222 ++
 13 files changed, 1866 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e4469760/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index 98387c2..8eeea77 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -68,7 +68,7 @@ if [[ -n "$SPARK_TESTING" ]]; then
   unset YARN_CONF_DIR
   unset HADOOP_CONF_DIR
   export PYTHONHASHSEED=0
-  exec "$PYSPARK_DRIVER_PYTHON" -m "$1"
+  exec "$PYSPARK_DRIVER_PYTHON" -m "$@"
   exit
 fi
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e4469760/dev/deps/spark-deps-hadoop-2.6
--
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 9287bd4..9868c1a 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -13,6 +13,9 @@ apacheds-kerberos-codec-2.0.0-M15.jar
 api-asn1-api-1.0.0-M20.jar
 api-util-1.0.0-M20.jar
 arpack_combined_all-0.1.jar
+arrow-format-0.4.0.jar
+arrow-memory-0.4.0.jar
+arrow-vector-0.4.0.jar
 avro-1.7.7.jar
 avro-ipc-1.7.7.jar
 avro-mapred-1.7.7-hadoop2.jar
@@ -55,6 +58,7 @@ datanucleus-core-3.2.10.jar
 datanucleus-rdbms-3.2.9.jar
 derby-10.12.1.1.jar
 eigenbase-properties-1.1.5.jar
+flatbuffers-1.2.0-3f79e055.jar
 gson-2.2.4.jar
 guava-14.0.1.jar
 guice-3.0.jar
@@ -77,6 +81,7 @@ hadoop-yarn-server-web-proxy-2.6.5.jar
 hk2-api-2.4.0-b34.jar
 hk2-locator-2.4.0-b34.jar