[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

GitBox Thu, 22 Sep 2022 20:03:52 -0700


zhengruifeng commented on code in PR #37923:
URL: https://github.com/apache/spark/pull/37923#discussion_r978229729



##########
python/pyspark/pandas/groupby.py:
##########
@@ -3237,10 +3337,10 @@ def _validate_agg_columns(self, numeric_only: 
Optional[bool], function_name: str
             if not numeric_only:
                 if has_non_numeric:
                     warnings.warn(
-                        "Dropping invalid columns in DataFrameGroupBy.mean is 
deprecated. "

Review Comment:
   nice catch!



##########
python/pyspark/pandas/groupby.py:
##########
@@ -993,6 +993,106 @@ def nth(self, n: int) -> FrameLike:
 
         return self._prepare_return(DataFrame(internal))
 
+    def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> 
FrameLike:
+        """
+        Compute prod of groups.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        numeric_only : bool, default False
+            Include only float, int, boolean columns. If None, will attempt to 
use
+            everything, then use only numeric data.
+
+        min_count: int, default 0
+            The required number of valid values to perform the operation.
+            If fewer than min_count non-NA values are present the result will 
be NA.
+
+        Returns
+        -------
+        Series or DataFrame
+            Computed prod of values within each group.
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+
+        Examples
+        --------
+        >>> import numpy as np
+        >>> df = ps.DataFrame(
+        ...     {
+        ...         "A": [1, 1, 2, 1, 2],
+        ...         "B": [np.nan, 2, 3, 4, 5],
+        ...         "C": [1, 2, 1, 1, 2],
+        ...         "D": [True, False, True, False, True],
+        ...     }
+        ... )
+
+        Groupby one column and return the prod of the remaining columns in
+        each group.
+
+        >>> df.groupby('A').prod().sort_index()
+             B  C  D
+        A
+        1  8.0  2  0
+        2  15.0 2  1
+
+        >>> df.groupby('A').prod(min_count=3).sort_index()
+             B  C   D
+        A
+        1  NaN  2.0  0.0
+        2  NaN NaN  NaN
+        """
+
+        self._validate_agg_columns(numeric_only=numeric_only, 
function_name="prod")
+
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in 
range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=(NumericType, BooleanType),
+            bool_to_numeric=True,
+        )
+
+        psdf: DataFrame = DataFrame(internal)
+        if len(psdf._internal.column_labels) > 0:
+
+            stat_exprs = []
+            for label in psdf._internal.column_labels:
+                psser = psdf._psser_for(label)
+                column = psser._dtype_op.nan_to_null(psser).spark.column
+                data_type = psser.spark.data_type
+                aggregating = (
+                    F.product(column).cast(data_type)
+                    if isinstance(data_type, IntegralType)
+                    else F.product(column)
+                )
+
+                if min_count > 0:
+                    prod_scol = F.when(F.count(column) < min_count, 
F.lit(None)).otherwise(

Review Comment:
   ```suggestion
                       prod_scol = F.when(F.count(F.when(~F.isnull(column), 
F.lit(0))) < min_count, F.lit(None)).otherwise(
   ```
   
   sorry, I forgot to filter out invalid values in previous suggestion



##########
python/pyspark/pandas/groupby.py:
##########
@@ -993,6 +993,106 @@ def nth(self, n: int) -> FrameLike:
 
         return self._prepare_return(DataFrame(internal))
 
+    def prod(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> 
FrameLike:
+        """
+        Compute prod of groups.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        numeric_only : bool, default False
+            Include only float, int, boolean columns. If None, will attempt to 
use
+            everything, then use only numeric data.
+
+        min_count: int, default 0
+            The required number of valid values to perform the operation.
+            If fewer than min_count non-NA values are present the result will 
be NA.
+
+        Returns
+        -------
+        Series or DataFrame
+            Computed prod of values within each group.
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+
+        Examples
+        --------
+        >>> import numpy as np
+        >>> df = ps.DataFrame(
+        ...     {
+        ...         "A": [1, 1, 2, 1, 2],
+        ...         "B": [np.nan, 2, 3, 4, 5],
+        ...         "C": [1, 2, 1, 1, 2],
+        ...         "D": [True, False, True, False, True],
+        ...     }
+        ... )
+
+        Groupby one column and return the prod of the remaining columns in
+        each group.
+
+        >>> df.groupby('A').prod().sort_index()
+             B  C  D
+        A
+        1  8.0  2  0
+        2  15.0 2  1
+
+        >>> df.groupby('A').prod(min_count=3).sort_index()
+             B  C   D
+        A
+        1  NaN  2.0  0.0
+        2  NaN NaN  NaN
+        """
+
+        self._validate_agg_columns(numeric_only=numeric_only, 
function_name="prod")
+
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in 
range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=(NumericType, BooleanType),
+            bool_to_numeric=True,
+        )
+
+        psdf: DataFrame = DataFrame(internal)
+        if len(psdf._internal.column_labels) > 0:
+
+            stat_exprs = []
+            for label in psdf._internal.column_labels:
+                psser = psdf._psser_for(label)
+                column = psser._dtype_op.nan_to_null(psser).spark.column
+                data_type = psser.spark.data_type
+                aggregating = (
+                    F.product(column).cast(data_type)

Review Comment:
   ```suggestion
                       F.product(column).cast("long")
   ```
   
   what about always casting to `long`, since the `prod` easily output large 
numbers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37923: [SPARK-40334][PS] Implement `GroupBy.prod`

Reply via email to