Copilot commented on code in PR #56066:
URL: https://github.com/apache/spark/pull/56066#discussion_r3289174823
##########
python/pyspark/pandas/frame.py:
##########
@@ -13901,7 +13912,7 @@ def _build_groupby(
) -> "DataFrameGroupBy":
from pyspark.pandas.groupby import DataFrameGroupBy
- return DataFrameGroupBy._build(self, by, as_index=as_index,
dropna=dropna)
+ return self._constructorGroupBy._build(self, by, as_index=as_index,
dropna=dropna)
Review Comment:
`DataFrame._build_groupby` and `DataFrame.resample` now depend on
`self._constructorGroupBy` and `self._constructorResampler`, but this diff
doesn’t show those attributes/properties being defined. If they’re not defined
somewhere in the class hierarchy, these calls will raise `AttributeError` at
runtime. Consider (mandatory) either: (1) adding `@property` definitions (e.g.,
returning `DataFrameGroupBy` / `DataFrameResampler`) consistent with the new
constructor pattern, or (2) reverting these two call sites to the previous
concrete classes if constructor customization isn’t intended here.
##########
python/pyspark/pandas/frame.py:
##########
@@ -13965,7 +13976,7 @@ def resample(
if len(agg_columns) == 0:
raise ValueError("No available aggregation columns!")
- return DataFrameResampler(
+ return self._constructorResampler(
Review Comment:
`DataFrame._build_groupby` and `DataFrame.resample` now depend on
`self._constructorGroupBy` and `self._constructorResampler`, but this diff
doesn’t show those attributes/properties being defined. If they’re not defined
somewhere in the class hierarchy, these calls will raise `AttributeError` at
runtime. Consider (mandatory) either: (1) adding `@property` definitions (e.g.,
returning `DataFrameGroupBy` / `DataFrameResampler`) consistent with the new
constructor pattern, or (2) reverting these two call sites to the previous
concrete classes if constructor customization isn’t intended here.
##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad
+ def test_extension_properties(self):
Review Comment:
There are blank lines with trailing whitespace in the new test (e.g., around
lines 164 and 172). Please remove trailing whitespace to keep the file clean
and avoid lint noise.
##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad
+ def test_extension_properties(self):
+ # Define subclasses mimicking the original pandas subclass
implementation
+ class SubclassedSeries(ps.Series):
+ @property
+ def _constructor(self):
+ return SubclassedSeries
+
+ @property
+ def _constructor_expanddim(self):
+ return SubclassedDataFrame
+
+ class SubclassedDataFrame(ps.DataFrame):
+ @property
+ def _constructor(self):
+ return SubclassedDataFrame
+
+ @property
+ def _constructor_sliced(self):
+ return SubclassedSeries
+
+ # Test DataFrame extension properties
+ sub_psdf = SubclassedDataFrame(self.psdf._internal)
+ result_df = sub_psdf.head(2)
+
Review Comment:
There are blank lines with trailing whitespace in the new test (e.g., around
lines 164 and 172). Please remove trailing whitespace to keep the file clean
and avoid lint noise.
##########
python/pyspark/pandas/series.py:
##########
@@ -2796,7 +2804,8 @@ def head(self, n: int = 5) -> "Series":
1 bee
Name: animal, dtype: object
"""
- return first_series(self.to_frame().head(n)).rename(self.name)
+ res = first_series(self.to_frame().head(n)).rename(self.name)
+ return self._constructor(data=res)
Review Comment:
`Series.head()` now constructs an intermediate `Series` (`res`) and then
re-wraps it by calling the constructor again. This adds extra object
construction (and potentially extra planning/metadata work) on an operation
that is commonly used interactively. Consider (mandatory) restructuring to
construct the subclass result directly from the already-produced internal/frame
output (e.g., by leveraging the frame returned by `to_frame().head(n)` and
converting to the sliced constructor once), avoiding the second `Series(...)`
construction.
##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad
+ def test_extension_properties(self):
+ # Define subclasses mimicking the original pandas subclass
implementation
+ class SubclassedSeries(ps.Series):
+ @property
+ def _constructor(self):
+ return SubclassedSeries
+
+ @property
+ def _constructor_expanddim(self):
+ return SubclassedDataFrame
+
+ class SubclassedDataFrame(ps.DataFrame):
+ @property
+ def _constructor(self):
+ return SubclassedDataFrame
+
+ @property
+ def _constructor_sliced(self):
+ return SubclassedSeries
+
+ # Test DataFrame extension properties
+ sub_psdf = SubclassedDataFrame(self.psdf._internal)
+ result_df = sub_psdf.head(2)
+
+ self.assertIsInstance(result_df, SubclassedDataFrame)
+ self.assertEqual(result_df.shape, (2, 2))
+
+ # Test Series extension properties
+ # Pass the PySpark Series directly instead of its _internal frame
+ sub_psser = SubclassedSeries(self.psdf["a"])
+ result_ser = sub_psser.head(2)
+
+ self.assertIsInstance(result_ser, SubclassedSeries)
+ self.assertEqual(len(result_ser), 2)
+
Review Comment:
The new test only validates subclass preservation for `head()` on
`DataFrame` and `Series`, but this PR updates many other return paths to use
`_constructor` / `_constructor_sliced` / `_constructor_expanddim` (e.g.,
`Series.to_frame`, `DataFrame._apply_series_op`, `DataFrame.transform`, etc.).
Consider adding (recommended) at least one assertion that covers `DataFrame`
slicing to `Series` (to exercise `_constructor_sliced`) and `Series.to_frame()`
(to exercise `_constructor_expanddim`), so regressions in those paths are
caught.
##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad
+ def test_extension_properties(self):
+ # Define subclasses mimicking the original pandas subclass
implementation
+ class SubclassedSeries(ps.Series):
+ @property
+ def _constructor(self):
+ return SubclassedSeries
+
+ @property
+ def _constructor_expanddim(self):
+ return SubclassedDataFrame
+
+ class SubclassedDataFrame(ps.DataFrame):
+ @property
+ def _constructor(self):
+ return SubclassedDataFrame
+
+ @property
+ def _constructor_sliced(self):
+ return SubclassedSeries
+
+ # Test DataFrame extension properties
+ sub_psdf = SubclassedDataFrame(self.psdf._internal)
+ result_df = sub_psdf.head(2)
+
+ self.assertIsInstance(result_df, SubclassedDataFrame)
+ self.assertEqual(result_df.shape, (2, 2))
+
+ # Test Series extension properties
+ # Pass the PySpark Series directly instead of its _internal frame
+ sub_psser = SubclassedSeries(self.psdf["a"])
+ result_ser = sub_psser.head(2)
+
+ self.assertIsInstance(result_ser, SubclassedSeries)
+ self.assertEqual(len(result_ser), 2)
+
Review Comment:
The new test only validates subclass preservation for `head()` on
`DataFrame` and `Series`, but this PR updates many other return paths to use
`_constructor` / `_constructor_sliced` / `_constructor_expanddim` (e.g.,
`Series.to_frame`, `DataFrame._apply_series_op`, `DataFrame.transform`, etc.).
Consider adding (recommended) at least one assertion that covers `DataFrame`
slicing to `Series` (to exercise `_constructor_sliced`) and `Series.to_frame()`
(to exercise `_constructor_expanddim`), so regressions in those paths are
caught.
##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad
+ def test_extension_properties(self):
Review Comment:
The new test only validates subclass preservation for `head()` on
`DataFrame` and `Series`, but this PR updates many other return paths to use
`_constructor` / `_constructor_sliced` / `_constructor_expanddim` (e.g.,
`Series.to_frame`, `DataFrame._apply_series_op`, `DataFrame.transform`, etc.).
Consider adding (recommended) at least one assertion that covers `DataFrame`
slicing to `Series` (to exercise `_constructor_sliced`) and `Series.to_frame()`
(to exercise `_constructor_expanddim`), so regressions in those paths are
caught.
##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
with self.assertRaises(AttributeError):
ps.Series([1, 2], dtype=object).bad
+ def test_extension_properties(self):
+ # Define subclasses mimicking the original pandas subclass
implementation
+ class SubclassedSeries(ps.Series):
+ @property
+ def _constructor(self):
+ return SubclassedSeries
+
+ @property
+ def _constructor_expanddim(self):
+ return SubclassedDataFrame
+
+ class SubclassedDataFrame(ps.DataFrame):
+ @property
+ def _constructor(self):
+ return SubclassedDataFrame
+
+ @property
+ def _constructor_sliced(self):
+ return SubclassedSeries
+
+ # Test DataFrame extension properties
+ sub_psdf = SubclassedDataFrame(self.psdf._internal)
+ result_df = sub_psdf.head(2)
+
+ self.assertIsInstance(result_df, SubclassedDataFrame)
+ self.assertEqual(result_df.shape, (2, 2))
+
+ # Test Series extension properties
+ # Pass the PySpark Series directly instead of its _internal frame
+ sub_psser = SubclassedSeries(self.psdf["a"])
+ result_ser = sub_psser.head(2)
+
Review Comment:
There are blank lines with trailing whitespace in the new test (e.g., around
lines 164 and 172). Please remove trailing whitespace to keep the file clean
and avoid lint noise.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]