This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 10d7acc81c5 [SPARK-43889][PYTHON] add check for column name for `__dir__()` to filter out illegal column name 10d7acc81c5 is described below commit 10d7acc81c5cd1d14abccbe9fe8c600213fb6c30 Author: Beishao Cao <beishao....@databricks.com> AuthorDate: Wed May 31 15:58:47 2023 +0900 [SPARK-43889][PYTHON] add check for column name for `__dir__()` to filter out illegal column name ### What changes were proposed in this pull request? Add a check for `__dir__()` in `pyspark.sql.dataframe.DataFrame` to filter out those illegal column name(e.g: name?1, name 1, 2name etc.). ### Why are the changes needed? 1. df.illegal_column_ame is not runnable(like df.name?1 will raise error) 2. In this way, `df.|` won't suggest those illegal name. This behavior is consistent with pandas. 3. Supplement for 2: This behavior is not consistent with `getattr`, `getattr(df, 'column with space') `still works even though `df.column with space` does not. `dir()` can only be consistent with one of these. Pandas behavior is to have `dir()` consistent with dot notation, so we are choosing to conform with Pandas; even though there is an argument to choose the other behavior. Example with this change: https://github.com/apache/spark/assets/109033553/a3238b5a-53b6-4994-8f11-c804a5aab53b ### Does this PR introduce _any_ user-facing change? Will change the output of dir(df). If the user chooses to use the private method df.__dir__(), they will also notice an output and docstring difference there. ### How was this patch tested? New doctest with three assertions. Output where I only ran this test: <img width="1052" alt="Screenshot 2023-05-30 at 11 12 04 AM" src="https://github.com/apache/spark/assets/109033553/c727631a-1028-4a24-a341-680e741cec3f"> Also test in databricks notebook with mock code: ``` class DataFrameWithColAttrs(DataFrame): def __init__(self, df): super().__init__(df._jdf, df._sql_ctx if df._sql_ctx else df._session) def __dir__(self): attrs = set(super().__dir__()) attrs.update(filter(lambda s: s.isidentifier(), self.columns)) return attrs ``` Closes #41393 from BeishaoCao-db/dir-CheckColumnName. Authored-by: Beishao Cao <beishao....@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/pyspark/sql/dataframe.py | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index d98f025c50c..12c445de21d 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -3062,9 +3062,16 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin): >>> df = df.withColumn('id2', lit(3)) >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # result includes id2 and sorted ['i_like_pancakes', 'id', 'id2', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty'] + + Don't include columns that are not valid python identifiers. + + >>> df = df.withColumn('1', lit(4)) + >>> df = df.withColumn('name 1', lit(5)) + >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Doesn't include 1 or name 1 + ['i_like_pancakes', 'id', 'id2', 'inputFiles', 'intersect', 'intersectAll', 'isEmpty'] """ attrs = set(super().__dir__()) - attrs.update(self.columns) + attrs.update(filter(lambda s: s.isidentifier(), self.columns)) return sorted(attrs) @overload --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org