[spark] branch branch-3.4 updated: [SPARK-42194][PS] Allow `columns` parameter when creating DataFrame with Series

gurwls223 Sat, 28 Jan 2023 18:34:45 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.4 by this push:
     new dbedb81ed39 [SPARK-42194][PS] Allow `columns` parameter when creating 
DataFrame with Series
dbedb81ed39 is described below

commit dbedb81ed39ca5561a1907260b84fa8dd96ea825
Author: itholic <haejoon....@databricks.com>
AuthorDate: Sun Jan 29 11:34:19 2023 +0900

    [SPARK-42194][PS] Allow `columns` parameter when creating DataFrame with 
Series
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to allow `columns` parameter when creating `ps.DataFrame` 
with `ps.Series` with limited condition.
    
    ### Why are the changes needed?
    
    In pandas, they attach the new column consists with missing values when 
`columns` contains more than 2 columns including valid column:
    
    ```python
    >>> pser  # pandas Series
    0.427027    1
    0.904592    2
    0.599768    3
    Name: x, dtype: int64
    
    >>> pd.DataFrame(pser, columns=["x", "y", "z"])
              x    y    z
    0.427027  1  NaN  NaN
    0.904592  2  NaN  NaN
    0.599768  3  NaN  NaN
    ```
    
    But this method is potentially pretty expensive in pandas API on Spark, so 
I guess that's why we currently don't support it.
    
    However, I've seen examples of using the following:
    
    ```python
    >>> ps.DataFrame(pser, columns=["x"])
              x
    0.427027  1
    0.904592  2
    0.599768  3
    ```
    
    As shown in the example above, this just works the same as 
`pd.DataFrame(pser)` (without `columns`).
    
    But it fails with `ps.Series` as below:
    
    ```python
    >>> ps.DataFrame(psser, columns=["x"])  # `psser` is pandas-on-Spark Series
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/pandas/frame.py", line 539, in __init__
        assert columns is None
    AssertionError
    ```
    
    In this case, user might just want to clearly state column names in their 
code, so I believe we can allow this rather than raising an `AssertionError`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    **Before**
    ```python
    >>> ps.DataFrame(psser, columns=["x"])  # `psser` is pandas-on-Spark Series
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/pandas/frame.py", line 539, in __init__
        assert columns is None
    AssertionError
    ```
    
    **After**
    ```python
    >>> ps.DataFrame(psser, columns=["x"])  # `psser` is pandas-on-Spark Series
              x
    0.427027  1
    0.904592  2
    0.599768  3
    ```
    
    ### How was this patch tested?
    
    Added UTs.
    
    Closes #39786 from itholic/SPARK-42194.
    
    Authored-by: itholic <haejoon....@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
    (cherry picked from commit 086c8d9d6ce91974e97ab47aab1cf54974e12bbf)
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/pyspark/pandas/frame.py                | 7 ++++++-
 python/pyspark/pandas/tests/test_dataframe.py | 7 +++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index a217066eff6..4a6c2119104 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -536,9 +536,14 @@ class DataFrame(Frame, Generic[T]):
             if index is None:
                 internal = data._internal
         elif isinstance(data, ps.Series):
-            assert columns is None
             assert dtype is None
             assert not copy
+            # For pandas compatibility when `columns` contains only one valid 
column.
+            if columns is not None:
+                assert isinstance(columns, (dict, list, tuple))
+                assert len(columns) == 1
+                columns = list(columns.keys()) if isinstance(columns, dict) 
else columns
+                assert columns[0] == data._internal.data_spark_column_names[0]
             if index is None:
                 internal = data.to_frame()._internal
         else:
diff --git a/python/pyspark/pandas/tests/test_dataframe.py 
b/python/pyspark/pandas/tests/test_dataframe.py
index 1b06d321e13..d33c6584f7f 100644
--- a/python/pyspark/pandas/tests/test_dataframe.py
+++ b/python/pyspark/pandas/tests/test_dataframe.py
@@ -90,6 +90,13 @@ class DataFrameTest(ComparisonTestBase, SQLTestUtils):
         psser = ps.from_pandas(pser)
         self.assert_eq(pd.DataFrame(pser), ps.DataFrame(psser))
 
+        # check ps.DataFrame(ps.Series) with `columns`
+        self.assert_eq(pd.DataFrame(pser, columns=["x"]), ps.DataFrame(psser, 
columns=["x"]))
+        self.assert_eq(pd.DataFrame(pser, columns=("x",)), ps.DataFrame(psser, 
columns=("x",)))
+        self.assert_eq(
+            pd.DataFrame(pser, columns={"x": None}), ps.DataFrame(psser, 
columns={"x": None})
+        )
+
         # check psdf[pd.Index]
         pdf, psdf = self.df_pair
         column_mask = pdf.columns.isin(["a", "b"])


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-42194][PS] Allow `columns` parameter when creating DataFrame with Series

Reply via email to