Pyspark DataFrame.drop wrong type hints

Oliver Beagley Fri, 21 Jun 2024 10:00:38 -0700

Hi there,



I believe I have found an error with the type hints for `DataFrame.drop` in
pyspark. The first overload at
https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/dataframe.py#L5559-L5568
isn’t as a `*args` argument, and therefore doesn’t allow specifying
multiple `Columns` in `drop`. Additionally, according to the python docs
<https://docs.python.org/3/library/typing.html#typing.overload>, type hints
on the non-overloaded definition should be ignored by a type checker and so
mixing specifying the `str` name of a field with `Column` expressions in
the final `drop` definition should not be used by a type checker. However,
the code in both the connect
<https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/connect/dataframe.py#L489-L496>
and
classic
<https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/classic/dataframe.py#L1712-L1720>
implementation doesn't appear to have any issues with mixing `Columns` with
`str`, nor specifying multiple `Column`s, so I think the overloads here are
unnecessary altogether and the final declaration is sufficient asis.


Is my understanding correct? Or is there something more that I'm missing as
to why that is typed like that?


Thanks,

Olly

Pyspark DataFrame.drop wrong type hints

Reply via email to