Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20211#discussion_r160605967
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql](
     
         val groupingNamedExpressions = groupingExprs.map {
           case ne: NamedExpression => ne
    -      case other => Alias(other, other.toString)()
    +      case other => Alias(other, toPrettySQL(other))()
         }
         val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
         val child = df.logicalPlan
         val project = Project(groupingNamedExpressions ++ child.output, child)
    -    val output = expr.dataType.asInstanceOf[StructType].toAttributes
    -    val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output, 
project)
    +    val udfOutput: Seq[Attribute] = 
expr.dataType.asInstanceOf[StructType].toAttributes
    +    val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]()
    +
    +    for (attribute <- groupingAttributes) {
    +      if (!udfOutput.map(_.name).contains(attribute.name)) {
    --- End diff --
    
    I'm wondering whether we should decide the additional grouping attributes 
by only their names?
    
    For example from tests:
    
    ```python
    result3 = df.groupby('id', 'v').apply(foo).sort('id', 'v').toPandas()
    ```
    
    The column `v` in `result3` is not the actual grouping value, which is 
overwritten by the returned value from the UDF because the returned column name 
contains the name. I'm not sure it is the desired behavior.
    
      


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to