[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20163 +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20163 +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 One more SGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20163 SGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20163 Given the above discussion, do we have consensus on all of the following: - Update the documentation for PySpark UDFs to warn about the behavior of mismatched declared `returnType` vs actual runtime return values - Make Python UDFs that declared `returnType` as `StringType` recognize `java.util.Calendar` and convert the value into a `null` (as in the example from @HyukjinKwon ), essentially marking it as unconvertible. I believe we all agree on the first point. The second point above is in line with @icexelloss 's opinion, which I tend to agree in terms of API semantic consistency. It might not be as user-friendly as Option 2 from @HyukjinKwon , but it's less magic and more consistent. I tend to find more consistency leads to less surprises. If we have consensus then I'll update the JIRA ticket and this PR to reflect that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 Probably we consider to catch and set nulls in pandas_udf if possible to match the behaviour with udf ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20163 I investigated the behavior differences between `udf` and `pandas_udf` for the wrong return types and found there are many differences actually. Basically `udf`s return `null` as @HyukjinKwon mentioned, whereas `pandas_udf`s throw some `ArrowException`. There seem some exceptions, though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20163 The current behavior looks weird, we should either throw exception and ask users to give a corrected return type or fix it via proposal 2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 @cloud-fan, actually I have the similar question too - https://github.com/apache/spark/pull/20163#discussion_r160017637. I tend to agree with it and I think we disallow this and document this. Just want to check if you feel strongly about this. If we need to support this, I believe the ways are 2. or 3. in https://github.com/apache/spark/pull/20163#issuecomment-355717642. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 @ueshin @icexelloss @cloud-fan @rednaxelafx, which one would you prefer? To me, I like 1 at most. If the perf diff is trivial, 2. is also fine. If 3. works fine, I think I am also fine with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 Hey @rednaxelafx that's fine. We all make mistake and I usually think it's always better then not trying. I also made a mistake at the first time. It was easier to debug this with your comments and details in the PR description. Thank you. > I'd like to wait for more discussions / suggestions on whether or not we want a behavior change that makes this reproducer work, or a simple document change that'll just say PySpark doesn't support mismatching returnType. So, few options might be ... 1. Simple document this 2. `str` logics in `type.StringType` - in this case, I think we should do a small banchmark. It'd would be so hard and I think you could reuse commands I used here - https://github.com/apache/spark/pull/19246#discussion_r139874732 3. Investigate the way to register a custom Pyrolite unpickler that converts `datetime.date*` to `Timestamp` or `Date`. I believe we already have some custom fixes there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20163 Thanks for all of your comments, @HyukjinKwon and @icexelloss ! I'd like to wait for more discussions / suggestions on whether or not we want a behavior change that makes this reproducer work, or a simple document change that'll just say PySpark doesn't support mismatching `returnType`. All of what you guys mentioned are correct. Sorry for the mess, I actually got myself confused... It's been a while since I first noticed the problem and came up with a patch, and obviously when I picked it back up I've lost some context as to what exactly failed in the first place... Both @HyukjinKwon and @icexelloss correctly pointed out that the bug only happens when the `udf()` creation declared a mismatching type versus what it actually returns. In the reproducer, the declared UDF return type is the default `string` type but the actual return types were `datetime.date` / `datetime.datetime`. That followed the path of not going through in `pyspark/sql/types.py`, so it went through to Pyrolite getting unpickled as a `java.util.Calendar`. A note on how I got here: the reason why my current PR (incorrectly) contained the cases for `case (Calendar, DateType)` and friends was that, initially I only had a reproducer for a Python UDF actually returning `datetime.date` but was using the default return type (`string`), and I had fixed it by introducing a case for converting from `java.util.Calendar` to the appropriate type in `EvaluatePython.scala`. At that time that case was certainly executed and it did give me correct results for that single reproducer. I did notice that the cases where the UDF return type was correctly declared was working correctly. But then I realized I also needed to handle the case where I have to tell apart `datetime.date` and `datetime.datetime`, and then I can't do that in a single `case (Calendar, StringType)` in `EvaluatePython.scala` anymore because I'm lacking the type information from the Python side at that point. So I went back to the Python side and thought I needed to handle `datetime.date` / `datetime.datetime` separately, but eventually they ended up both being handled by just a `str(value)` coercion in a band-aid fix in `udf.py`. The version that @HyukjinKwon suggested above is indeed a proper version of that. At this point the new code in `EvaluatePython.scala` and friends are dead code. To address a point from @icexelloss : > The change that the PR proposes seem to be coercing python datetime.datetime and datetime.date to the python datetime string representation rather the java one. The reason why I used `str()` there was that both Python and Spark SQL followed the same default string formats for date (`-MM-dd`) and datetime (`-MM-dd HH:mm:ss`), e.g. ```c static PyObject * date_isoformat(PyDateTime_Date *self) { return PyUnicode_FromFormat("%04d-%02d-%02d", GET_YEAR(self), GET_MONTH(self), GET_DAY(self)); } ``` and ```scala // `SimpleDateFormat` is not thread-safe. private val threadLocalDateFormat = new ThreadLocal[DateFormat] { override def initialValue(): SimpleDateFormat = { new SimpleDateFormat("-MM-dd", Locale.US) } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20163 I ran some experiments: ``` py_date = udf(datetime.date, DateType()) py_timestamp = udf(datetime.datetime, TimestampType()) ``` This works correctly ``` spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show() spark.range(1).select(py_timestamp(lit(2017), lit(10), lit(30))).show() ``` Result: ``` +--+ |date(2017, 10, 30)| +--+ |2017-10-30| +--+ +--+ |datetime(2017, 10, 30)| +--+ | 2017-10-30 00:00:00| +--+ ``` The change that the PR proposes seem to be coercing python `datetime.datetime` and `datetime.date` to the python datetime string representation rather the java one. We could call function `str` on the return value of the python udf if it's a String type to get the python string representation, but this probably needs some microbenchmark to see the performance implication. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 The problem here seems, `returnType` is mismatched to the value. In case of `DateType`, it needs an explicit conversion into integers: https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L170-L171 https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L173-L175 which will be called via in `worker.py` https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74 If the `returnType` is `StringType`, then it doesn't need the conversion because Pyrolite and serialization work fine between them up to my knowledge: https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L141-L145 https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L76-L82 So, here: https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74 we will send the return values as are without conversion, which ends up with `datetime.date` -> `java.util.Calendar` as you described in the PR description. Therefore, I don't think the current fix in `EvaluatePython.scala` is reachable in the reproducer above. For the fix in Python side in `udf.py`, this is a band-aid fix. To deal with this problem correctly, I believe we should do something like: ```diff diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 146e673ae97..37137e02c08 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -144,6 +144,17 @@ class StringType(AtomicType): __metaclass__ = DataTypeSingleton +def needConversion(self): +return True + +def toInternal(self, v): +if v is not None: +return str(v) + +def fromInternal(self, v): +if v is not None: +return str(v) + ``` but then this will bring performance regression because `str` is required to be called every value. This extra function call could cause performance regression, for example, see both https://github.com/apache/spark/pull/19246 and https://github.com/apache/spark/pull/19249. I am less sure this is something we should allow. Can we simply document this saying `returnType` should be compatible with the actual return value? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20163 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85709/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20163 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20163 **[Test build #85709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)** for PR 20163 at commit [`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 Wait .. Isn't this because we failed to call `toInternal` by the return type? Please give me few days .. will double check tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20163 I think Scalar and Group map UDF expect pandas Series of datetime64[ns] (native pandas timestamp type) instead of a pandas Series of datetime.date and datetime.datetime object. I don't think it's necessary to have pandas UDF to work with a pandas Series of datetime.date or datetime.datetime object, as the standard type of timestamp is datetime64[ns] in pandas. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20163 LGTM, cc @ueshin @icexelloss does this behavior consistent with pandas UDF? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20163 **[Test build #85709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)** for PR 20163 at commit [`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org