Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20163 Thanks for all of your comments, @HyukjinKwon and @icexelloss ! I'd like to wait for more discussions / suggestions on whether or not we want a behavior change that makes this reproducer work, or a simple document change that'll just say PySpark doesn't support mismatching `returnType`. All of what you guys mentioned are correct. Sorry for the mess, I actually got myself confused... It's been a while since I first noticed the problem and came up with a patch, and obviously when I picked it back up I've lost some context as to what exactly failed in the first place... Both @HyukjinKwon and @icexelloss correctly pointed out that the bug only happens when the `udf()` creation declared a mismatching type versus what it actually returns. In the reproducer, the declared UDF return type is the default `string` type but the actual return types were `datetime.date` / `datetime.datetime`. That followed the path of not going through in `pyspark/sql/types.py`, so it went through to Pyrolite getting unpickled as a `java.util.Calendar`. A note on how I got here: the reason why my current PR (incorrectly) contained the cases for `case (Calendar, DateType)` and friends was that, initially I only had a reproducer for a Python UDF actually returning `datetime.date` but was using the default return type (`string`), and I had fixed it by introducing a case for converting from `java.util.Calendar` to the appropriate type in `EvaluatePython.scala`. At that time that case was certainly executed and it did give me correct results for that single reproducer. I did notice that the cases where the UDF return type was correctly declared was working correctly. But then I realized I also needed to handle the case where I have to tell apart `datetime.date` and `datetime.datetime`, and then I can't do that in a single `case (Calendar, StringType)` in `EvaluatePython.scala` anymore because I'm lacking the type information from the Python side at that point. So I went back to the Python side and thought I needed to handle `datetime.date` / `datetime.datetime` separately, but eventually they ended up both being handled by just a `str(value)` coercion in a band-aid fix in `udf.py`. The version that @HyukjinKwon suggested above is indeed a proper version of that. At this point the new code in `EvaluatePython.scala` and friends are dead code. To address a point from @icexelloss : > The change that the PR proposes seem to be coercing python datetime.datetime and datetime.date to the python datetime string representation rather the java one. The reason why I used `str()` there was that both Python and Spark SQL followed the same default string formats for date (`yyyy-MM-dd`) and datetime (`yyyy-MM-dd HH:mm:ss`), e.g. ```c static PyObject * date_isoformat(PyDateTime_Date *self) { return PyUnicode_FromFormat("%04d-%02d-%02d", GET_YEAR(self), GET_MONTH(self), GET_DAY(self)); } ``` and ```scala // `SimpleDateFormat` is not thread-safe. private val threadLocalDateFormat = new ThreadLocal[DateFormat] { override def initialValue(): SimpleDateFormat = { new SimpleDateFormat("yyyy-MM-dd", Locale.US) } } ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org