[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18659
  
LGTM, merging to master!

We can address remaining minor comments in follow-up, and have new PRs to 
remove the 0-parameter UDF and use arrow streaming protocol.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82053/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #82053 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)**
 for PR 18659 at commit 
[`b8ffa50`](https://github.com/apache/spark/commit/b8ffa50132d0290c0796fb99eb37fe010f56a182).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #82053 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82053/testReport)**
 for PR 18659 at commit 
[`b8ffa50`](https://github.com/apache/spark/commit/b8ffa50132d0290c0796fb99eb37fe010f56a182).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82042/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #82042 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82042/testReport)**
 for PR 18659 at commit 
[`53926cc`](https://github.com/apache/spark/commit/53926cc56a6b4bceea2a02441e4d00a3a69b5289).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #82042 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82042/testReport)**
 for PR 18659 at commit 
[`53926cc`](https://github.com/apache/spark/commit/53926cc56a6b4bceea2a02441e4d00a3a69b5289).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-21 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18659
  
Thanks @ueshin , that works to allow the tests to pass. I do worry that it 
might cause some other issues and I would much prefer we upgrade Arrow to 
handle this, but I'll push this and we can discuss.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-20 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/18659
  
@BryanCutler Hmm, I'm not exactly sure the reason why it doesn't work (or 
mine works) but we can use `fillna(0)` before casting like:

```
pa.Array.from_pandas(s.fillna(0).astype(t.to_pandas_dtype(), copy=False), 
mask=s.isnull(), type=t)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-20 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18659
  
@ueshin I haven't had much luck with the casting workaround:
```
pa.Array.from_pandas(s.astype(t.to_pandas_dtype(), copy=False), 
mask=s.isnull(), type=t)
```
It appears that it forces a copy for floating point -> integer and then 
checks if any NaNs, so I get the error `ValueError: Cannot convert non-finite 
values (NA or inf) to integer`.  I'm using Pandas 0.20.1, but also tried 0.19.4 
with the same result, any ideas?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18659
  
ok let's work around the type casting issue and discuss arrow upgrading 
later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81955/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81955 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81955/testReport)**
 for PR 18659 at commit 
[`f451d65`](https://github.com/apache/spark/commit/f451d652a2656113cce1f0763e17c73ed2d03c44).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18659
  
> what if users installed an older version of pyarrow? Shall we throw 
exception and ask them to upgrade, or work around type casting issue?

@cloud-fan , in regards to handling of problems that might come up if using 
different versions of Arrow, I think we should first decide on a minimum 
supported version, then maybe we could put that version of pyarrow as a 
requirement for PySpark.  If we decide to use 0.4.1 which we currently use, 
then we should probably work around the type casting issue and make sure this 
PR works with that version.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18659
  
Regarding the upgrade of Arrow, the concerns of #18974 are still valid - 
namely it has some risk and upgrading the Python side is a good amount of work 
that only a couple of people have the access to do.  Would it be better to 
discuss the upgrade strategy in another JIRA?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81955 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81955/testReport)**
 for PR 18659 at commit 
[`f451d65`](https://github.com/apache/spark/commit/f451d652a2656113cce1f0763e17c73ed2d03c44).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18659
  
Thanks for the reviews @ueshin @viirya and @HyukjinKwon !  I updated with 
your comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81945/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81945 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81945/testReport)**
 for PR 18659 at commit 
[`69112a5`](https://github.com/apache/spark/commit/69112a5a771bc3c98a7cd0b21ffda883d86c41a4).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `// enable memo iff we serialize the row with schema (schema and 
class should be memorized)`
  * `abstract class EvalPythonExec(udfs: Seq[PythonUDF], output: 
Seq[Attribute], child: SparkPlan)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81945 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81945/testReport)**
 for PR 18659 at commit 
[`69112a5`](https://github.com/apache/spark/commit/69112a5a771bc3c98a7cd0b21ffda883d86c41a4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18659
  
what if users installed an older version of pyarrow? Shall we throw 
exception and ask them to upgrade, or work around type casting issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/18659
  
@BryanCutler I'm ok to upgrade pyarrow to 0.7 except for the same concerns 
as #18974.
I guess we need to discuss upgrade policy and strategy of pyarrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81899/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81899 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81899/testReport)**
 for PR 18659 at commit 
[`d49a3db`](https://github.com/apache/spark/commit/d49a3dbbe4e5952e73259309424c201537971bb9).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18659
  
@ueshin , the tests are all passing now when using pyarrow 0.7 (just 
released).  This added better support for type coercion in `Array.from_pandas` 
which makes handling null values a little easier.  What are your thoughts on 
upgrading?  We will probably want to do so to get some improvements for Decimal 
types anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81899/testReport)**
 for PR 18659 at commit 
[`d49a3db`](https://github.com/apache/spark/commit/d49a3dbbe4e5952e73259309424c201537971bb9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-18 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/18659
  
@BryanCutler I think it's okay to rename `size` to `length` (or longer name 
to avoid name-conflict like `_length_`?).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18659
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81834/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81834 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81834/testReport)**
 for PR 18659 at commit 
[`25e3a71`](https://github.com/apache/spark/commit/25e3a715e990709043daaaf23e5de088418a83ee).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18659
  
@ueshin , I merged your tests and added support for `**kwargs` to use 
"size" for 0-parameter UDFs.  Do you think this might be a little better to be 
called "length" or "output_length"?

I still need to fix the tests with null values, it shouldn't be a problem 
with the fix you provided.  I'll add that soon. Otherwise I believe other tests 
are working, but we probably need to check chained UDFs also.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Python Vectorized UDFs

2017-09-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18659
  
**[Test build #81834 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81834/testReport)**
 for PR 18659 at commit 
[`25e3a71`](https://github.com/apache/spark/commit/25e3a715e990709043daaaf23e5de088418a83ee).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org