This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.4 by this push: new dbedb81ed39 [SPARK-42194][PS] Allow `columns` parameter when creating DataFrame with Series dbedb81ed39 is described below commit dbedb81ed39ca5561a1907260b84fa8dd96ea825 Author: itholic <haejoon....@databricks.com> AuthorDate: Sun Jan 29 11:34:19 2023 +0900 [SPARK-42194][PS] Allow `columns` parameter when creating DataFrame with Series ### What changes were proposed in this pull request? This PR proposes to allow `columns` parameter when creating `ps.DataFrame` with `ps.Series` with limited condition. ### Why are the changes needed? In pandas, they attach the new column consists with missing values when `columns` contains more than 2 columns including valid column: ```python >>> pser # pandas Series 0.427027 1 0.904592 2 0.599768 3 Name: x, dtype: int64 >>> pd.DataFrame(pser, columns=["x", "y", "z"]) x y z 0.427027 1 NaN NaN 0.904592 2 NaN NaN 0.599768 3 NaN NaN ``` But this method is potentially pretty expensive in pandas API on Spark, so I guess that's why we currently don't support it. However, I've seen examples of using the following: ```python >>> ps.DataFrame(pser, columns=["x"]) x 0.427027 1 0.904592 2 0.599768 3 ``` As shown in the example above, this just works the same as `pd.DataFrame(pser)` (without `columns`). But it fails with `ps.Series` as below: ```python >>> ps.DataFrame(psser, columns=["x"]) # `psser` is pandas-on-Spark Series Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/pandas/frame.py", line 539, in __init__ assert columns is None AssertionError ``` In this case, user might just want to clearly state column names in their code, so I believe we can allow this rather than raising an `AssertionError`. ### Does this PR introduce _any_ user-facing change? **Before** ```python >>> ps.DataFrame(psser, columns=["x"]) # `psser` is pandas-on-Spark Series Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/pandas/frame.py", line 539, in __init__ assert columns is None AssertionError ``` **After** ```python >>> ps.DataFrame(psser, columns=["x"]) # `psser` is pandas-on-Spark Series x 0.427027 1 0.904592 2 0.599768 3 ``` ### How was this patch tested? Added UTs. Closes #39786 from itholic/SPARK-42194. Authored-by: itholic <haejoon....@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> (cherry picked from commit 086c8d9d6ce91974e97ab47aab1cf54974e12bbf) Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/pyspark/pandas/frame.py | 7 ++++++- python/pyspark/pandas/tests/test_dataframe.py | 7 +++++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index a217066eff6..4a6c2119104 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -536,9 +536,14 @@ class DataFrame(Frame, Generic[T]): if index is None: internal = data._internal elif isinstance(data, ps.Series): - assert columns is None assert dtype is None assert not copy + # For pandas compatibility when `columns` contains only one valid column. + if columns is not None: + assert isinstance(columns, (dict, list, tuple)) + assert len(columns) == 1 + columns = list(columns.keys()) if isinstance(columns, dict) else columns + assert columns[0] == data._internal.data_spark_column_names[0] if index is None: internal = data.to_frame()._internal else: diff --git a/python/pyspark/pandas/tests/test_dataframe.py b/python/pyspark/pandas/tests/test_dataframe.py index 1b06d321e13..d33c6584f7f 100644 --- a/python/pyspark/pandas/tests/test_dataframe.py +++ b/python/pyspark/pandas/tests/test_dataframe.py @@ -90,6 +90,13 @@ class DataFrameTest(ComparisonTestBase, SQLTestUtils): psser = ps.from_pandas(pser) self.assert_eq(pd.DataFrame(pser), ps.DataFrame(psser)) + # check ps.DataFrame(ps.Series) with `columns` + self.assert_eq(pd.DataFrame(pser, columns=["x"]), ps.DataFrame(psser, columns=["x"])) + self.assert_eq(pd.DataFrame(pser, columns=("x",)), ps.DataFrame(psser, columns=("x",))) + self.assert_eq( + pd.DataFrame(pser, columns={"x": None}), ps.DataFrame(psser, columns={"x": None}) + ) + # check psdf[pd.Index] pdf, psdf = self.df_pair column_mask = pdf.columns.isin(["a", "b"]) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org