[GitHub] [spark] amaliujia commented on a diff in pull request #38938: [WIP][SPARK-41403][CONNECT][PYTHON] Implement `DataFrame.describe`

2022-12-06 Thread GitBox


amaliujia commented on code in PR #38938:
URL: https://github.com/apache/spark/pull/38938#discussion_r1041708865


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1239,6 +1239,16 @@ def summary(self, *statistics: str) -> "DataFrame":
 session=self._session,
 )
 
+def describe(self, *cols: str) -> "DataFrame":
+_cols: List[str] = list(cols)

Review Comment:
   oops I don't know why they are gone but we need those new lines. Otherwise 
this PR won't pass the lint check...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38938: [WIP][SPARK-41403][CONNECT][PYTHON] Implement `DataFrame.describe`

2022-12-06 Thread GitBox


amaliujia commented on code in PR #38938:
URL: https://github.com/apache/spark/pull/38938#discussion_r1041691534


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1239,6 +1239,16 @@ def summary(self, *statistics: str) -> "DataFrame":
 session=self._session,
 )
 
+def describe(self, *cols: str) -> "DataFrame":
+_cols: List[str] = list(cols)

Review Comment:
   ```suggestion
   """Computes basic statistics for numeric and string columns
   
   .. versionadded:: 3.4.0
   
   This include count, mean, stddev, min, and max. If no columns are
   given, this function computes statistics for all numerical or string 
columns.
   
   Notes
   -
   This function is meant for exploratory data analysis, as we make no
   guarantee about the backward compatibility of the schema of the 
resulting
   :class:`DataFrame`.
   Use summary for expanded statistics and control over which 
statistics to compute.
   
   Parameters
   --
   cols : str, list, optional
Column name or list of column names to describe by (default All 
columns).
   
   Returns
   ---
   :class:`DataFrame`
   A new DataFrame that describes (provides statistics) given 
DataFrame.
   """
   _cols: List[str] = list(cols)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38938: [WIP][SPARK-41403][CONNECT][PYTHON] Implement `DataFrame.describe`

2022-12-06 Thread GitBox


amaliujia commented on code in PR #38938:
URL: https://github.com/apache/spark/pull/38938#discussion_r1041691534


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1239,6 +1239,16 @@ def summary(self, *statistics: str) -> "DataFrame":
 session=self._session,
 )
 
+def describe(self, *cols: str) -> "DataFrame":
+_cols: List[str] = list(cols)

Review Comment:
   ```suggestion
   """Computes basic statistics for numeric and string columns.
   .. versionadded:: 3.4.0
   This include count, mean, stddev, min, and max. If no columns are
   given, this function computes statistics for all numerical or string 
columns.
   Notes
   -
   This function is meant for exploratory data analysis, as we make no
   guarantee about the backward compatibility of the schema of the 
resulting
   :class:`DataFrame`.
   Use summary for expanded statistics and control over which 
statistics to compute.
   Parameters
   --
   cols : str, list, optional
Column name or list of column names to describe by (default All 
columns).
   Returns
   ---
   :class:`DataFrame`
   A new DataFrame that describes (provides statistics) given 
DataFrame.
   """
   _cols: List[str] = list(cols)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38938: [WIP][SPARK-41403][CONNECT][PYTHON] Implement `DataFrame.describe`

2022-12-06 Thread GitBox


amaliujia commented on code in PR #38938:
URL: https://github.com/apache/spark/pull/38938#discussion_r1041314131


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -404,6 +405,18 @@ message StatSummary {
   repeated string statistics = 2;
 }
 
+// Computes basic statistics for numeric and string columns, including count, 
mean, stddev, min,
+// and max. If no columns are given, this function computes statistics for all 
numerical or
+// string columns.
+// It will invoke 'Dataset.describe' to compute the results.
+message StatDescribe {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // Columns to compute statistics on.
+  repeated string cols = 2;

Review Comment:
   Can you follow 
https://github.com/apache/spark/blob/master/connector/connect/docs/adding-proto-messages.md
 to mark if this is required or optional?



##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -404,6 +405,18 @@ message StatSummary {
   repeated string statistics = 2;
 }
 
+// Computes basic statistics for numeric and string columns, including count, 
mean, stddev, min,
+// and max. If no columns are given, this function computes statistics for all 
numerical or
+// string columns.
+// It will invoke 'Dataset.describe' to compute the results.

Review Comment:
   `It will invoke 'Dataset.describe' to compute the results.`
   
   I tend to avoid say this as this exposes implementation details while the 
comment here actually serves as proto documentation. (E.g. we might switch the 
implementation to something else but forget to update here thus cause 
inconsistency) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org