Tian Gao created SPARK-54498:
--------------------------------

             Summary: `toInternal` and `fromInternal` in `types.py` has weird 
type annotations
                 Key: SPARK-54498
                 URL: https://issues.apache.org/jira/browse/SPARK-54498
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Tian Gao


All `toInternals` and `fromInternals` in `pyspark/sql/types.py` have really 
weird type annotation and run-time behavior.

For example:
{code:java}
    def toInternal(self, dt: datetime.datetime) -> int:
        if dt is not None:
            seconds = calendar.timegm(dt.timetuple())
            return int(seconds) * 1000000 + dt.microsecond {code}
Do we expect `dt` to be anything other than `datetime.datetime`? What is the 
behavior when the user passes something that's not `datetime.datetime`? This is 
a public API so user can do whatever they want with it.

Now it could raise an exception if the user passes something weird, or 
implicitly return a None if user passes None.

We should have a consistent standard for all datatypes - do we return `None` if 
we can't convert? Do we only return `None` when the input is `None` (which is 
then a valid input) and raise exceptions when it's other types?

My suggestion:
 * Only deal with data when the data is exactly the type we expected
 * Raise an exception if the type is wrong

It's super clear and matches the current type hint - otherwise we need to 
modify the type hint.

Not sure if we have backward compatibility issues here (like do we really need 
to support `None` as an input).

[~gurwls223] and [~podongfeng] ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to