[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18659 LGTM, merging to master! We can address remaining minor comments in follow-up, and have new PRs to remove the 0-parameter UDF and use arrow streaming protocol. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82053/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)** for PR 18659 at commit [`b8ffa50`](https://github.com/apache/spark/commit/b8ffa50132d0290c0796fb99eb37fe010f56a182). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)** for PR 18659 at commit [`b8ffa50`](https://github.com/apache/spark/commit/b8ffa50132d0290c0796fb99eb37fe010f56a182). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82042/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82042 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82042/testReport)** for PR 18659 at commit [`53926cc`](https://github.com/apache/spark/commit/53926cc56a6b4bceea2a02441e4d00a3a69b5289). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82042 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82042/testReport)** for PR 18659 at commit [`53926cc`](https://github.com/apache/spark/commit/53926cc56a6b4bceea2a02441e4d00a3a69b5289). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 Thanks @ueshin , that works to allow the tests to pass. I do worry that it might cause some other issues and I would much prefer we upgrade Arrow to handle this, but I'll push this and we can discuss. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18659 @BryanCutler Hmm, I'm not exactly sure the reason why it doesn't work (or mine works) but we can use `fillna(0)` before casting like: ``` pa.Array.from_pandas(s.fillna(0).astype(t.to_pandas_dtype(), copy=False), mask=s.isnull(), type=t) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 @ueshin I haven't had much luck with the casting workaround: ``` pa.Array.from_pandas(s.astype(t.to_pandas_dtype(), copy=False), mask=s.isnull(), type=t) ``` It appears that it forces a copy for floating point -> integer and then checks if any NaNs, so I get the error `ValueError: Cannot convert non-finite values (NA or inf) to integer`. I'm using Pandas 0.20.1, but also tried 0.19.4 with the same result, any ideas? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18659 ok let's work around the type casting issue and discuss arrow upgrading later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81955/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81955/testReport)** for PR 18659 at commit [`f451d65`](https://github.com/apache/spark/commit/f451d652a2656113cce1f0763e17c73ed2d03c44). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 > what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue? @cloud-fan , in regards to handling of problems that might come up if using different versions of Arrow, I think we should first decide on a minimum supported version, then maybe we could put that version of pyarrow as a requirement for PySpark. If we decide to use 0.4.1 which we currently use, then we should probably work around the type casting issue and make sure this PR works with that version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 Regarding the upgrade of Arrow, the concerns of #18974 are still valid - namely it has some risk and upgrading the Python side is a good amount of work that only a couple of people have the access to do. Would it be better to discuss the upgrade strategy in another JIRA? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81955/testReport)** for PR 18659 at commit [`f451d65`](https://github.com/apache/spark/commit/f451d652a2656113cce1f0763e17c73ed2d03c44). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 Thanks for the reviews @ueshin @viirya and @HyukjinKwon ! I updated with your comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81945/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81945/testReport)** for PR 18659 at commit [`69112a5`](https://github.com/apache/spark/commit/69112a5a771bc3c98a7cd0b21ffda883d86c41a4). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `// enable memo iff we serialize the row with schema (schema and class should be memorized)` * `abstract class EvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81945/testReport)** for PR 18659 at commit [`69112a5`](https://github.com/apache/spark/commit/69112a5a771bc3c98a7cd0b21ffda883d86c41a4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18659 what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18659 @BryanCutler I'm ok to upgrade pyarrow to 0.7 except for the same concerns as #18974. I guess we need to discuss upgrade policy and strategy of pyarrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81899/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81899/testReport)** for PR 18659 at commit [`d49a3db`](https://github.com/apache/spark/commit/d49a3dbbe4e5952e73259309424c201537971bb9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 @ueshin , the tests are all passing now when using pyarrow 0.7 (just released). This added better support for type coercion in `Array.from_pandas` which makes handling null values a little easier. What are your thoughts on upgrading? We will probably want to do so to get some improvements for Decimal types anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81899/testReport)** for PR 18659 at commit [`d49a3db`](https://github.com/apache/spark/commit/d49a3dbbe4e5952e73259309424c201537971bb9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18659 @BryanCutler I think it's okay to rename `size` to `length` (or longer name to avoid name-conflict like `_length_`?). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81834/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81834/testReport)** for PR 18659 at commit [`25e3a71`](https://github.com/apache/spark/commit/25e3a715e990709043daaaf23e5de088418a83ee). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 @ueshin , I merged your tests and added support for `**kwargs` to use "size" for 0-parameter UDFs. Do you think this might be a little better to be called "length" or "output_length"? I still need to fix the tests with null values, it shouldn't be a problem with the fix you provided. I'll add that soon. Otherwise I believe other tests are working, but we probably need to check chained UDFs also. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81834/testReport)** for PR 18659 at commit [`25e3a71`](https://github.com/apache/spark/commit/25e3a715e990709043daaaf23e5de088418a83ee). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org