Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 The problem here seems, `returnType` is mismatched to the value. In case of `DateType`, it needs an explicit conversion into integers: https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L170-L171 https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L173-L175 which will be called via in `worker.py` https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74 If the `returnType` is `StringType`, then it doesn't need the conversion because Pyrolite and serialization work fine between them up to my knowledge: https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L141-L145 https://github.com/apache/spark/blob/1c9f95cb771ac78775a77edd1abfeb2d8ae2a124/python/pyspark/sql/types.py#L76-L82 So, here: https://github.com/apache/spark/blob/64817c423c0d82a805abd69a3e166e5bfd79c739/python/pyspark/worker.py#L70-L74 we will send the return values as are without conversion, which ends up with `datetime.date` -> `java.util.Calendar` as you described in the PR description. Therefore, I don't think the current fix in `EvaluatePython.scala` is reachable in the reproducer above. For the fix in Python side in `udf.py`, this is a band-aid fix. To deal with this problem correctly, I believe we should do something like: ```diff diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 146e673ae97..37137e02c08 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -144,6 +144,17 @@ class StringType(AtomicType): __metaclass__ = DataTypeSingleton + def needConversion(self): + return True + + def toInternal(self, v): + if v is not None: + return str(v) + + def fromInternal(self, v): + if v is not None: + return str(v) + ``` but then this will bring performance regression because `str` is required to be called every value. This extra function call could cause performance regression, for example, see both https://github.com/apache/spark/pull/19246 and https://github.com/apache/spark/pull/19249. I am less sure this is something we should allow. Can we simply document this saying `returnType` should be compatible with the actual return value?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org