Tian Gao created SPARK-54498:
--------------------------------
Summary: `toInternal` and `fromInternal` in `types.py` has weird
type annotations
Key: SPARK-54498
URL: https://issues.apache.org/jira/browse/SPARK-54498
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Tian Gao
All `toInternals` and `fromInternals` in `pyspark/sql/types.py` have really
weird type annotation and run-time behavior.
For example:
{code:java}
def toInternal(self, dt: datetime.datetime) -> int:
if dt is not None:
seconds = calendar.timegm(dt.timetuple())
return int(seconds) * 1000000 + dt.microsecond {code}
Do we expect `dt` to be anything other than `datetime.datetime`? What is the
behavior when the user passes something that's not `datetime.datetime`? This is
a public API so user can do whatever they want with it.
Now it could raise an exception if the user passes something weird, or
implicitly return a None if user passes None.
We should have a consistent standard for all datatypes - do we return `None` if
we can't convert? Do we only return `None` when the input is `None` (which is
then a valid input) and raise exceptions when it's other types?
My suggestion:
* Only deal with data when the data is exactly the type we expected
* Raise an exception if the type is wrong
It's super clear and matches the current type hint - otherwise we need to
modify the type hint.
Not sure if we have backward compatibility issues here (like do we really need
to support `None` as an input).
[~gurwls223] and [~podongfeng] ?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]