Pyspark DataFrame.drop wrong type hints

Oliver Beagley Fri, 21 Jun 2024 09:39:59 -0700

Hi there,



I believe I have found an error with the type hints for `DataFrame.drop` in
pyspark. The first overload at
https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/dataframe.py#L5559-L5568
isn’t
as a `*args` argument, and therefore doesn’t allow specifying multiple
`Columns` in `drop`. Additionally, according to the python docs
<https://docs.python.org/3/library/typing.html#typing.overload>, type hints
on the non-overloaded definition should be ignored by a type checker and so
mixing specifying the `str` name of a field with `Column` expressions in
the final `drop` definition should not be used by a type checker. However,
the code in both the connect
<https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/connect/dataframe.py#L489-L496>
 and classic
<https://github.com/apache/spark/blob/0bc38acc615ad411a97779c6a1ff43d4391c0c3d/python/pyspark/sql/classic/dataframe.py#L1712-L1720>
implementation doesn't
appear to have any issues with mixing `Columns` with `str`, nor specifying
multiple `Column`s, so I think the overloads here are unnecessary
altogether and the final declaration is sufficient asis.


Is my understanding correct? Or is there something more that I'm missing as
to why that is typed like that?


Thanks,

Olly

Pyspark DataFrame.drop wrong type hints

Reply via email to