[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22610
  
**[Test build #96935 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96935/testReport)**
 for PR 22610 at commit 
[`a756c0b`](https://github.com/apache/spark/commit/a756c0b40f74f35027d65a5c143bfa4b9f5f89fb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96935/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22610
  
**[Test build #96935 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96935/testReport)**
 for PR 22610 at commit 
[`a756c0b`](https://github.com/apache/spark/commit/a756c0b40f74f35027d65a5c143bfa4b9f5f89fb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3675/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-04 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
So I've added a bit document for this. @HyukjinKwon @BryanCutler please 
check it when you have time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22610
  
One clear thing looks adding some documentation .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
Btw, I checked our `_minimum_pyarrow_version` is 0.8.0, so seems like even 
there is next upgrade available, for users with pyarrow versions before 0.11.0, 
this is still an potential issue. Isn't?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
I think it is more reasonable to use the option when converting from Pandas 
to raise an error for unsafe casts. It should be better than to display warning 
message.

Not sure how long before next upgrade, do you think we should add some 
words into document to explain this pitfalls especially? Or we just leave it 
until next upgrade? @HyukjinKwon @BryanCutler 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
> Yeah, it's part of pyarrow.Array now, but will only be in the 0.11.0 
release so we would have to do it after the next upgrade.

Then I think we can wait for next upgrade to use this feature of 
pyarrow.Array and raise exception on unsafe cast.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/22610
  
> It is pretty new one, is it said we need to upgrade to latest PyArrow in 
order to use it? Since it is an option at Table.from_pandas, is it possible to 
extend it to pyarrow.Array?

Yeah, it's part of pyarrow.Array now, but will only be in the 0.11.0 
release so we would have to do it after the next upgrade.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/22610
  
> Thanks, @BryanCutler. WDYT about documenting the type map thing?

I think that would help in the cases of dates/times because those can get a 
little confusing. For primitives, I think it's pretty straightforward, so I 
don't know how much that would help. Maybe it we just highlight some potential 
pitfalls?

The problem here was that when a null value was introduced, Pandas 
automatically converted the data to float to insert a NaN value, then the Arrow 
conversion from float to bool is broken. When the data just had ints, the 
conversion seems ok, so it ended up giving inconsistent confusing results.  Not 
sure what might have helped here, it's just a nasty bug :)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
Thanks @BryanCutler! Looks like an useful option. It is pretty new one, is 
it said we need to upgrade to latest PyArrow in order to use it? Since it is an 
option at `Table.from_pandas`, is it possible to extend it to `pyarrow.Array`?  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22610
  
Thanks, @BryanCutler. WDYT about documenting the type map thing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/22610
  
So pyarrow just added an option when converting from Pandas to raise an 
error for unsafe casts. I'd have to try it out to see if it would prevent this 
case though.  It's a common option when working with Pandas, so users might be 
familiar with it and might be more useful to expose this as a Spark conf rather 
than checking the types.

Btw, I'm working on fixing the float to boolean conversion here 
https://github.com/apache/arrow/pull/2698


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-03 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
@HyukjinKwon Thanks!

I agree that having document of this is definitely useful. I will try to 
add it and let's see if it is ok for you. I think it is good to mention that 
users are responsible for ensuring return type of Pandas UDF matches defined 
return type. The mapping is good reference to show in the document too.








---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22610
  
> he only other idea I have is to provide an option to raise an error if 
the type needs to be cast.

Actually sounds good to me.

I think the problem is we are not quite clear about when the type is 
mismatched in UDFs (see also https://github.com/apache/spark/pull/20163 for a 
reminder). IIRC, we rather roughly agreed upon documenting it (and allowing 
exact type match (?)).

@viirya and @BryanCutler, how about we document that return types should be 
matched (we can leave a chart or map referring (`types.to_arrow_type`)?

One additional improvement might be .. we describe that type casting 
behaviour is .. say .. not guaranteed but I am not sure how we can nicely 
document this. Probably only mentioning the type mapping is fine ..? 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
Thanks @BryanCutler! Yes, this should not be a bug but is used as a warning 
to users that there might be some type conversion they are not noticed at first 
glance on the Pandas UDFs. For now the conversion is silently done behind the 
scene and as the case in the JIRA shows it might not be easily noticed that 
Pandas.Series from UDFs isn't matched with defined UDFs' return types.





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/22610
  
Thanks for looking into this @viirya !  I agree that there are lots of 
cases where casting to another type is intentional and works fine, so this 
isn't a bug. The only other idea I have is to provide an option to raise an 
error if the type needs to be cast. That might be possible with pyarrow right 
now, but I'm not sure how useful it would be.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22610
  
The idea sounds good to me from a cursory look for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22610
  
Yea, I will do this week. Sorry I missed the cc in the JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96853/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22610
  
**[Test build #96853 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96853/testReport)**
 for PR 22610 at commit 
[`c084e74`](https://github.com/apache/spark/commit/c084e745007d455a6ea99e10cc403b55ead6278d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22610
  
cc @HyukjinKwon Can you take a look at this when you have time? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3626/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22610
  
**[Test build #96853 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96853/testReport)**
 for PR 22610 at commit 
[`c084e74`](https://github.com/apache/spark/commit/c084e745007d455a6ea99e10cc403b55ead6278d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22610
  
**[Test build #96849 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96849/testReport)**
 for PR 22610 at commit 
[`d206b7c`](https://github.com/apache/spark/commit/d206b7cf78f898e622f539a15e45515fcbd9e54a).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96849/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22610
  
**[Test build #96849 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96849/testReport)**
 for PR 22610 at commit 
[`d206b7c`](https://github.com/apache/spark/commit/d206b7cf78f898e622f539a15e45515fcbd9e54a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22610
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3622/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org