[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-11 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20163
  
+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-11 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20163
  
+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
One more SGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-11 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20163
  
SGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-10 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20163
  
Given the above discussion, do we have consensus on all of the following:
- Update the documentation for PySpark UDFs to warn about the behavior of 
mismatched declared `returnType` vs actual runtime return values
- Make Python UDFs that declared `returnType` as `StringType` recognize 
`java.util.Calendar` and convert the value into a `null` (as in the example 
from @HyukjinKwon ), essentially marking it as unconvertible.

I believe we all agree on the first point. The second point above is in 
line with @icexelloss 's opinion, which I tend to agree in terms of API 
semantic consistency. It might not be as user-friendly as Option 2 from 
@HyukjinKwon , but it's less magic and more consistent. I tend to find more 
consistency leads to less surprises.

If we have consensus then I'll update the JIRA ticket and this PR to 
reflect that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
Probably we consider to catch and set nulls in pandas_udf if possible to 
match the behaviour with udf ... 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-09 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20163
  
I investigated the behavior differences between `udf` and `pandas_udf` for 
the wrong return types and found there are many differences actually.
Basically `udf`s return `null` as @HyukjinKwon mentioned, whereas 
`pandas_udf`s throw some `ArrowException`. There seem some exceptions, though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-08 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20163
  
The current behavior looks weird, we should either throw exception and ask 
users to give a corrected return type or fix it via proposal 2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
@cloud-fan, actually I have the similar question too - 
https://github.com/apache/spark/pull/20163#discussion_r160017637. I tend to 
agree with it and I think we disallow this and document this.

Just want to check if you feel strongly about this. If we need to support 
this, I believe the ways are 2. or 3. in 
https://github.com/apache/spark/pull/20163#issuecomment-355717642.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
@ueshin @icexelloss @cloud-fan @rednaxelafx, which one would you prefer?

To me, I like 1 at most. If the perf diff is trivial, 2. is also fine. If 
3. works fine, I think I am also fine with it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
Hey @rednaxelafx that's fine. We all make mistake and I usually think it's 
always better then not trying. I also made a mistake at the first time. It was 
easier to debug this with your comments and details in the PR description. 
Thank you.

> I'd like to wait for more discussions / suggestions on whether or not we 
want a behavior change that makes this reproducer work, or a simple document 
change that'll just say PySpark doesn't support mismatching returnType.

So, few options might be ...

1. Simple document this

2. `str` logics in `type.StringType` - in this case, I think we should do a 
small banchmark. It'd would be so hard and I think you could reuse commands I 
used here - https://github.com/apache/spark/pull/19246#discussion_r139874732

3. Investigate the way to register a custom Pyrolite unpickler that 
converts `datetime.date*` to `Timestamp` or `Date`. I believe we already have 
some custom fixes there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-05 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20163
  
Thanks for all of your comments, @HyukjinKwon and @icexelloss !
I'd like to wait for more discussions / suggestions on whether or not we 
want a behavior change that makes this reproducer work, or a simple document 
change that'll just say PySpark doesn't support mismatching `returnType`.

All of what you guys mentioned are correct. Sorry for the mess, I actually 
got myself confused...
It's been a while since I first noticed the problem and came up with a 
patch, and obviously when I picked it back up I've lost some context as to what 
exactly failed in the first place...

Both @HyukjinKwon and @icexelloss correctly pointed out that the bug only 
happens when the `udf()` creation declared a mismatching type versus what it 
actually returns. In the reproducer, the declared UDF return type is the 
default `string` type but the actual return types were `datetime.date` / 
`datetime.datetime`. That followed the path of not going through in 
`pyspark/sql/types.py`, so it went through to Pyrolite getting unpickled as a 
`java.util.Calendar`.

A note on how I got here: the reason why my current PR (incorrectly) 
contained the cases for `case (Calendar, DateType)` and friends was that, 
initially I only had a reproducer for a Python UDF actually returning 
`datetime.date` but was using the default return type (`string`), and I had 
fixed it by introducing a case for converting from `java.util.Calendar` to the 
appropriate type in `EvaluatePython.scala`. At that time that case was 
certainly executed and it did give me correct results for that single 
reproducer. I did notice that the cases where the UDF return type was correctly 
declared was working correctly.

But then I realized I also needed to handle the case where I have to tell 
apart `datetime.date` and `datetime.datetime`, and then I can't do that in a 
single `case (Calendar, StringType)` in `EvaluatePython.scala` anymore because 
I'm lacking the type information from the Python side at that point. So I went 
back to the Python side and thought I needed to handle `datetime.date` / 
`datetime.datetime` separately, but eventually they ended up both being handled 
by just a `str(value)` coercion in a band-aid fix in `udf.py`. The version that 
@HyukjinKwon suggested above is indeed a proper version of that.
At this point the new code in `EvaluatePython.scala` and friends are dead 
code.

To address a point from @icexelloss :
> The change that the PR proposes seem to be coercing python 
datetime.datetime and datetime.date to the python datetime string 
representation rather the java one.
The reason why I used `str()` there was that both Python and Spark SQL 
followed the same default string formats for date (`-MM-dd`) and datetime 
(`-MM-dd HH:mm:ss`), e.g.
```c
static PyObject *
date_isoformat(PyDateTime_Date *self)
{
return PyUnicode_FromFormat("%04d-%02d-%02d",
GET_YEAR(self), GET_MONTH(self), 
GET_DAY(self));
}
```
and
```scala
  // `SimpleDateFormat` is not thread-safe.
  private val threadLocalDateFormat = new ThreadLocal[DateFormat] {
override def initialValue(): SimpleDateFormat = {
  new SimpleDateFormat("-MM-dd", Locale.US)
}
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-05 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20163
  
I ran some experiments:
```
py_date = udf(datetime.date, DateType())
py_timestamp = udf(datetime.datetime, TimestampType())
```
This works correctly
```
spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show()
spark.range(1).select(py_timestamp(lit(2017), lit(10), lit(30))).show()
```
Result:
```
+--+
|date(2017, 10, 30)|
+--+
|2017-10-30|
+--+

+--+
|datetime(2017, 10, 30)|
+--+
|   2017-10-30 00:00:00|
+--+
```

The change that the PR proposes seem to be coercing python 
`datetime.datetime` and `datetime.date` to the python datetime string 
representation rather the java one. We could call function `str` on the return 
value of the python udf if it's a String type to get the python string 
representation, but this probably needs some microbenchmark to see the 
performance implication.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
The problem here seems, `returnType` is mismatched to the value. In case of 
`DateType`, it needs an explicit conversion into integers:


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L170-L171


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L173-L175

which will be called via in `worker.py`


https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74

If the `returnType` is `StringType`, then it doesn't need the conversion 
because Pyrolite and serialization work fine between them up to my knowledge:


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L141-L145


https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L76-L82


So, here:



https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74


we will send the return values as are without conversion, which ends up 
with `datetime.date` -> `java.util.Calendar` as you described in the PR 
description. Therefore, I don't think the current fix in `EvaluatePython.scala` 
is reachable in the reproducer above.

For the fix in Python side in `udf.py`, this is a band-aid fix. To deal 
with this problem correctly, I believe we should do something like:

```diff
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 146e673ae97..37137e02c08 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -144,6 +144,17 @@ class StringType(AtomicType):

 __metaclass__ = DataTypeSingleton

+def needConversion(self):
+return True
+
+def toInternal(self, v):
+if v is not None:
+return str(v)
+
+def fromInternal(self, v):
+if v is not None:
+return str(v)
+
```

but then this will bring performance regression because `str` is required 
to be called every value. This extra function call could cause performance 
regression, for example, see both https://github.com/apache/spark/pull/19246 
and https://github.com/apache/spark/pull/19249.

I am less sure this is something we should allow. Can we simply document 
this saying `returnType` should be compatible with the actual return value?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20163
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85709/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20163
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20163
  
**[Test build #85709 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)**
 for PR 20163 at commit 
[`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
Wait .. Isn't this because we failed to call `toInternal` by the return 
type? Please give me few days .. will double check tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20163
  
I think Scalar and Group map UDF expect pandas Series of datetime64[ns] 
(native pandas timestamp type) instead of a pandas Series of datetime.date and 
datetime.datetime object. I don't think it's necessary to have pandas UDF to 
work with a pandas Series of datetime.date or datetime.datetime object, as the 
standard type of timestamp is datetime64[ns] in pandas.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20163
  
LGTM,  cc @ueshin @icexelloss does this behavior consistent with pandas UDF?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20163
  
**[Test build #85709 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)**
 for PR 20163 at commit 
[`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org