[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-22 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18659 LGTM, merging to master! We can address remaining minor comments in follow-up, and have new PRs to remove the 0-parameter UDF and use arrow streaming protocol. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82053/ Test PASSed. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82042/ Test PASSed. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82042 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82042/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #82042 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82042/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 Thanks @ueshin , that works to allow the tests to pass. I do worry that it might cause some other issues and I would much prefer we upgrade Arrow to handle this, but I'll push this and we can

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-20 Thread ueshin
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18659 @BryanCutler Hmm, I'm not exactly sure the reason why it doesn't work (or mine works) but we can use `fillna(0)` before casting like: ```

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-20 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 @ueshin I haven't had much luck with the casting workaround: ``` pa.Array.from_pandas(s.astype(t.to_pandas_dtype(), copy=False), mask=s.isnull(), type=t) ``` It appears that it

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18659 ok let's work around the type casting issue and discuss arrow upgrading later. --- - To unsubscribe, e-mail:

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81955/ Test FAILed. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81955/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 > what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue? @cloud-fan , in regards to handling of

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 Regarding the upgrade of Arrow, the concerns of #18974 are still valid - namely it has some risk and upgrading the Python side is a good amount of work that only a couple of people have the

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81955/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 Thanks for the reviews @ueshin @viirya and @HyukjinKwon ! I updated with your comments --- - To unsubscribe, e-mail:

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81945/ Test FAILed. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81945/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81945/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18659 what if users installed an older version of pyarrow? Shall we throw exception and ask them to upgrade, or work around type casting issue? ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread ueshin
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18659 @BryanCutler I'm ok to upgrade pyarrow to 0.7 except for the same concerns as #18974. I guess we need to discuss upgrade policy and strategy of pyarrow. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81899/ Test FAILed. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81899/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 @ueshin , the tests are all passing now when using pyarrow 0.7 (just released). This added better support for type coercion in `Array.from_pandas` which makes handling null values a little

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81899/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread ueshin
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18659 @BryanCutler I think it's okay to rename `size` to `length` (or longer name to avoid name-conflict like `_length_`?). --- - To

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18659 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81834/ Test FAILed. ---

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81834/testReport)** for PR 18659 at commit

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread BryanCutler
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18659 @ueshin , I merged your tests and added support for `**kwargs` to use "size" for 0-parameter UDFs. Do you think this might be a little better to be called "length" or "output_length"?

[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18659 **[Test build #81834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81834/testReport)** for PR 18659 at commit