Re: [PR] [SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties [spark]

via GitHub Fri, 22 May 2026 07:48:51 -0700


Copilot commented on code in PR #56066:
URL: https://github.com/apache/spark/pull/56066#discussion_r3289174823



##########
python/pyspark/pandas/frame.py:
##########
@@ -13901,7 +13912,7 @@ def _build_groupby(
     ) -> "DataFrameGroupBy":
         from pyspark.pandas.groupby import DataFrameGroupBy
 
-        return DataFrameGroupBy._build(self, by, as_index=as_index, 
dropna=dropna)
+        return self._constructorGroupBy._build(self, by, as_index=as_index, 
dropna=dropna)

Review Comment:
   `DataFrame._build_groupby` and `DataFrame.resample` now depend on 
`self._constructorGroupBy` and `self._constructorResampler`, but this diff 
doesn’t show those attributes/properties being defined. If they’re not defined 
somewhere in the class hierarchy, these calls will raise `AttributeError` at 
runtime. Consider (mandatory) either: (1) adding `@property` definitions (e.g., 
returning `DataFrameGroupBy` / `DataFrameResampler`) consistent with the new 
constructor pattern, or (2) reverting these two call sites to the previous 
concrete classes if constructor customization isn’t intended here.



##########
python/pyspark/pandas/frame.py:
##########
@@ -13965,7 +13976,7 @@ def resample(
         if len(agg_columns) == 0:
             raise ValueError("No available aggregation columns!")
 
-        return DataFrameResampler(
+        return self._constructorResampler(

Review Comment:
   `DataFrame._build_groupby` and `DataFrame.resample` now depend on 
`self._constructorGroupBy` and `self._constructorResampler`, but this diff 
doesn’t show those attributes/properties being defined. If they’re not defined 
somewhere in the class hierarchy, these calls will raise `AttributeError` at 
runtime. Consider (mandatory) either: (1) adding `@property` definitions (e.g., 
returning `DataFrameGroupBy` / `DataFrameResampler`) consistent with the new 
constructor pattern, or (2) reverting these two call sites to the previous 
concrete classes if constructor customization isn’t intended here.
   



##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
             with self.assertRaises(AttributeError):
                 ps.Series([1, 2], dtype=object).bad
 
+    def test_extension_properties(self):

Review Comment:
   There are blank lines with trailing whitespace in the new test (e.g., around 
lines 164 and 172). Please remove trailing whitespace to keep the file clean 
and avoid lint noise.



##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
             with self.assertRaises(AttributeError):
                 ps.Series([1, 2], dtype=object).bad
 
+    def test_extension_properties(self):
+        # Define subclasses mimicking the original pandas subclass 
implementation
+        class SubclassedSeries(ps.Series):
+            @property
+            def _constructor(self):
+                return SubclassedSeries
+
+            @property
+            def _constructor_expanddim(self):
+                return SubclassedDataFrame
+
+        class SubclassedDataFrame(ps.DataFrame):
+            @property
+            def _constructor(self):
+                return SubclassedDataFrame
+
+            @property
+            def _constructor_sliced(self):
+                return SubclassedSeries
+
+        # Test DataFrame extension properties
+        sub_psdf = SubclassedDataFrame(self.psdf._internal)
+        result_df = sub_psdf.head(2)
+        

Review Comment:
   There are blank lines with trailing whitespace in the new test (e.g., around 
lines 164 and 172). Please remove trailing whitespace to keep the file clean 
and avoid lint noise.



##########
python/pyspark/pandas/series.py:
##########
@@ -2796,7 +2804,8 @@ def head(self, n: int = 5) -> "Series":
         1     bee
         Name: animal, dtype: object
         """
-        return first_series(self.to_frame().head(n)).rename(self.name)
+        res = first_series(self.to_frame().head(n)).rename(self.name)
+        return self._constructor(data=res)

Review Comment:
   `Series.head()` now constructs an intermediate `Series` (`res`) and then 
re-wraps it by calling the constructor again. This adds extra object 
construction (and potentially extra planning/metadata work) on an operation 
that is commonly used interactively. Consider (mandatory) restructuring to 
construct the subclass result directly from the already-produced internal/frame 
output (e.g., by leveraging the frame returned by `to_frame().head(n)` and 
converting to the sliced constructor once), avoiding the second `Series(...)` 
construction.
   



##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
             with self.assertRaises(AttributeError):
                 ps.Series([1, 2], dtype=object).bad
 
+    def test_extension_properties(self):
+        # Define subclasses mimicking the original pandas subclass 
implementation
+        class SubclassedSeries(ps.Series):
+            @property
+            def _constructor(self):
+                return SubclassedSeries
+
+            @property
+            def _constructor_expanddim(self):
+                return SubclassedDataFrame
+
+        class SubclassedDataFrame(ps.DataFrame):
+            @property
+            def _constructor(self):
+                return SubclassedDataFrame
+
+            @property
+            def _constructor_sliced(self):
+                return SubclassedSeries
+
+        # Test DataFrame extension properties
+        sub_psdf = SubclassedDataFrame(self.psdf._internal)
+        result_df = sub_psdf.head(2)
+        
+        self.assertIsInstance(result_df, SubclassedDataFrame)
+        self.assertEqual(result_df.shape, (2, 2))
+
+        # Test Series extension properties
+        # Pass the PySpark Series directly instead of its _internal frame
+        sub_psser = SubclassedSeries(self.psdf["a"])
+        result_ser = sub_psser.head(2)
+        
+        self.assertIsInstance(result_ser, SubclassedSeries)
+        self.assertEqual(len(result_ser), 2)
+
 

Review Comment:
   The new test only validates subclass preservation for `head()` on 
`DataFrame` and `Series`, but this PR updates many other return paths to use 
`_constructor` / `_constructor_sliced` / `_constructor_expanddim` (e.g., 
`Series.to_frame`, `DataFrame._apply_series_op`, `DataFrame.transform`, etc.). 
Consider adding (recommended) at least one assertion that covers `DataFrame` 
slicing to `Series` (to exercise `_constructor_sliced`) and `Series.to_frame()` 
(to exercise `_constructor_expanddim`), so regressions in those paths are 
caught.
   



##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
             with self.assertRaises(AttributeError):
                 ps.Series([1, 2], dtype=object).bad
 
+    def test_extension_properties(self):
+        # Define subclasses mimicking the original pandas subclass 
implementation
+        class SubclassedSeries(ps.Series):
+            @property
+            def _constructor(self):
+                return SubclassedSeries
+
+            @property
+            def _constructor_expanddim(self):
+                return SubclassedDataFrame
+
+        class SubclassedDataFrame(ps.DataFrame):
+            @property
+            def _constructor(self):
+                return SubclassedDataFrame
+
+            @property
+            def _constructor_sliced(self):
+                return SubclassedSeries
+
+        # Test DataFrame extension properties
+        sub_psdf = SubclassedDataFrame(self.psdf._internal)
+        result_df = sub_psdf.head(2)
+        
+        self.assertIsInstance(result_df, SubclassedDataFrame)
+        self.assertEqual(result_df.shape, (2, 2))
+
+        # Test Series extension properties
+        # Pass the PySpark Series directly instead of its _internal frame
+        sub_psser = SubclassedSeries(self.psdf["a"])
+        result_ser = sub_psser.head(2)
+        
+        self.assertIsInstance(result_ser, SubclassedSeries)
+        self.assertEqual(len(result_ser), 2)
+
 

Review Comment:
   The new test only validates subclass preservation for `head()` on 
`DataFrame` and `Series`, but this PR updates many other return paths to use 
`_constructor` / `_constructor_sliced` / `_constructor_expanddim` (e.g., 
`Series.to_frame`, `DataFrame._apply_series_op`, `DataFrame.transform`, etc.). 
Consider adding (recommended) at least one assertion that covers `DataFrame` 
slicing to `Series` (to exercise `_constructor_sliced`) and `Series.to_frame()` 
(to exercise `_constructor_expanddim`), so regressions in those paths are 
caught.
   



##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
             with self.assertRaises(AttributeError):
                 ps.Series([1, 2], dtype=object).bad
 
+    def test_extension_properties(self):

Review Comment:
   The new test only validates subclass preservation for `head()` on 
`DataFrame` and `Series`, but this PR updates many other return paths to use 
`_constructor` / `_constructor_sliced` / `_constructor_expanddim` (e.g., 
`Series.to_frame`, `DataFrame._apply_series_op`, `DataFrame.transform`, etc.). 
Consider adding (recommended) at least one assertion that covers `DataFrame` 
slicing to `Series` (to exercise `_constructor_sliced`) and `Series.to_frame()` 
(to exercise `_constructor_expanddim`), so regressions in those paths are 
caught.



##########
python/pyspark/pandas/tests/test_extension.py:
##########
@@ -138,6 +138,41 @@ def __init__(self, data):
             with self.assertRaises(AttributeError):
                 ps.Series([1, 2], dtype=object).bad
 
+    def test_extension_properties(self):
+        # Define subclasses mimicking the original pandas subclass 
implementation
+        class SubclassedSeries(ps.Series):
+            @property
+            def _constructor(self):
+                return SubclassedSeries
+
+            @property
+            def _constructor_expanddim(self):
+                return SubclassedDataFrame
+
+        class SubclassedDataFrame(ps.DataFrame):
+            @property
+            def _constructor(self):
+                return SubclassedDataFrame
+
+            @property
+            def _constructor_sliced(self):
+                return SubclassedSeries
+
+        # Test DataFrame extension properties
+        sub_psdf = SubclassedDataFrame(self.psdf._internal)
+        result_df = sub_psdf.head(2)
+        
+        self.assertIsInstance(result_df, SubclassedDataFrame)
+        self.assertEqual(result_df.shape, (2, 2))
+
+        # Test Series extension properties
+        # Pass the PySpark Series directly instead of its _internal frame
+        sub_psser = SubclassedSeries(self.psdf["a"])
+        result_ser = sub_psser.head(2)
+        

Review Comment:
   There are blank lines with trailing whitespace in the new test (e.g., around 
lines 164 and 172). Please remove trailing whitespace to keep the file clean 
and avoid lint noise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53053][PYTHON][PANDAS] Support Pandas Extension Properties [spark]

Reply via email to