This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 10d7acc81c5 [SPARK-43889][PYTHON] add check for column name for 
`__dir__()` to filter out illegal column name
10d7acc81c5 is described below

commit 10d7acc81c5cd1d14abccbe9fe8c600213fb6c30
Author: Beishao Cao <beishao....@databricks.com>
AuthorDate: Wed May 31 15:58:47 2023 +0900

    [SPARK-43889][PYTHON] add check for column name for `__dir__()` to filter 
out illegal column name
    
    ### What changes were proposed in this pull request?
    Add a check for `__dir__()` in `pyspark.sql.dataframe.DataFrame` to filter 
out those illegal column name(e.g: name?1, name 1, 2name etc.).
    
    ### Why are the changes needed?
    1. df.illegal_column_ame is not runnable(like df.name?1 will raise error)
    2. In this way, `df.|` won't suggest those illegal name. This behavior is 
consistent with pandas.
    3. Supplement for 2: This behavior is not consistent with `getattr`, 
`getattr(df, 'column with space') `still works even though `df.column with 
space` does not. `dir()` can only be consistent with one of these. Pandas 
behavior is to have `dir()` consistent with dot notation, so we are choosing to 
conform with Pandas; even though there is an argument to choose the other 
behavior.
    
    Example with this change:
    
    
https://github.com/apache/spark/assets/109033553/a3238b5a-53b6-4994-8f11-c804a5aab53b
    
    ### Does this PR introduce _any_ user-facing change?
    Will change the output of dir(df). If the user chooses to use the private 
method df.__dir__(), they will also notice an output and docstring difference 
there.
    
    ### How was this patch tested?
    New doctest with three assertions. Output where I only ran this test:
    <img width="1052" alt="Screenshot 2023-05-30 at 11 12 04 AM" 
src="https://github.com/apache/spark/assets/109033553/c727631a-1028-4a24-a341-680e741cec3f";>
    Also test in databricks notebook with mock code:
    
    ```
    class DataFrameWithColAttrs(DataFrame):
      def __init__(self, df):
        super().__init__(df._jdf, df._sql_ctx if df._sql_ctx else df._session)
    
      def __dir__(self):
        attrs = set(super().__dir__())
        attrs.update(filter(lambda s: s.isidentifier(), self.columns))
        return attrs
    ```
    
    Closes #41393 from BeishaoCao-db/dir-CheckColumnName.
    
    Authored-by: Beishao Cao <beishao....@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/pyspark/sql/dataframe.py | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index d98f025c50c..12c445de21d 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -3062,9 +3062,16 @@ class DataFrame(PandasMapOpsMixin, 
PandasConversionMixin):
         >>> df = df.withColumn('id2', lit(3))
         >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # result includes 
id2 and sorted
         ['i_like_pancakes', 'id', 'id2', 'inputFiles', 'intersect', 
'intersectAll', 'isEmpty']
+
+        Don't include columns that are not valid python identifiers.
+
+        >>> df = df.withColumn('1', lit(4))
+        >>> df = df.withColumn('name 1', lit(5))
+        >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Doesn't include 
1 or name 1
+        ['i_like_pancakes', 'id', 'id2', 'inputFiles', 'intersect', 
'intersectAll', 'isEmpty']
         """
         attrs = set(super().__dir__())
-        attrs.update(self.columns)
+        attrs.update(filter(lambda s: s.isidentifier(), self.columns))
         return sorted(attrs)
 
     @overload


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to