[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 Thanks @HyukjinKwon and @ueshin ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20114 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85554/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85554 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85554/testReport)** for PR 20114 at commit [`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85554 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85554/testReport)** for PR 20114 at commit [`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85552/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85552 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85552/testReport)** for PR 20114 at commit [`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85552/testReport)** for PR 20114 at commit [`281ffdc`](https://github.com/apache/spark/commit/281ffdc9132829617af28dcb1668e2fa5eddc599). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85533/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85533/testReport)** for PR 20114 at commit [`25cf41c`](https://github.com/apache/spark/commit/25cf41c8ba804a7a6e8fbf9ebaf9498ce03fb063). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 The new workaround seems to be fine and I also added another test with array null values to test that along with all non-null values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85533/testReport)** for PR 20114 at commit [`25cf41c`](https://github.com/apache/spark/commit/25cf41c8ba804a7a6e8fbf9ebaf9498ce03fb063). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 > How about simply returning false from ArrowVectorAccessor.isNullAt(int rowId) when accessor.getValueCount() > 0 && accessor.getValidityBuffer().capacity() == 0 Good idea @ueshin , I think this should be fine as we are only querying the validity buffer in the call to `isNullAt`. I'll give it a try! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85505/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85505 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85505/testReport)** for PR 20114 at commit [`d2c5c2b`](https://github.com/apache/spark/commit/d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85505/testReport)** for PR 20114 at commit [`d2c5c2b`](https://github.com/apache/spark/commit/d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20114 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20114 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85499/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85499 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85499/testReport)** for PR 20114 at commit [`d2c5c2b`](https://github.com/apache/spark/commit/d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20114 How about simply returning `false` from `ArrowVectorAccessor.isNullAt(int rowId)` when `accessor.getValueCount() > 0 && accessor.getValidityBuffer().capacity() == 0` without modifying the buffer? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 ping @ueshin @HyukjinKwon Unfortunately, there was a bug in the Arrow 0.8.0 release on the Java side https://issues.apache.org/jira/browse/ARROW-1948 that caused a problem here. I was able to find a workaround, but it required me to make a change to the `ArrowVectorAccessor` class. I'm not sure if this is something you would be ok putting in, or if you would prefer to wait until the next minor release to add the ArrayType support. The issue was that the Arrow spec states that if the validity buffer is empty, then that means that all the values are non-null. In Arrow 0.8.0, the C++/Python side started sending buffers this way, and the Arrow ListVector was not handling it properly, thinking instead that there were no valid values. The workaround I added here looks if the ListVector has a value count of > 0 and has an empty validity buffer. This means that all the values are non-null and it will allocate a new validity buffer with all bits set. For Arrow with non-udfs (toPandas and createDataFrame) this only needs to be done once, but for udfs each batch read will load new buffers into the arrow VectorSchemaRoot, so it needs to be checked after each read. The simplest place to put the workaround to cover these cases was to allow `ArrowVectorAccessor.isNullAt(int rowId)` to be overridden. Let me know what you guys think, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20114 **[Test build #85499 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85499/testReport)** for PR 20114 at commit [`d2c5c2b`](https://github.com/apache/spark/commit/d2c5c2b4ea803ac8d1f08a5f79af1076f9e5bd2b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org