Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r160605967 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql]( val groupingNamedExpressions = groupingExprs.map { case ne: NamedExpression => ne - case other => Alias(other, other.toString)() + case other => Alias(other, toPrettySQL(other))() } val groupingAttributes = groupingNamedExpressions.map(_.toAttribute) val child = df.logicalPlan val project = Project(groupingNamedExpressions ++ child.output, child) - val output = expr.dataType.asInstanceOf[StructType].toAttributes - val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output, project) + val udfOutput: Seq[Attribute] = expr.dataType.asInstanceOf[StructType].toAttributes + val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]() + + for (attribute <- groupingAttributes) { + if (!udfOutput.map(_.name).contains(attribute.name)) { --- End diff -- I'm wondering whether we should decide the additional grouping attributes by only their names? For example from tests: ```python result3 = df.groupby('id', 'v').apply(foo).sort('id', 'v').toPandas() ``` The column `v` in `result3` is not the actual grouping value, which is overwritten by the returned value from the UDF because the returned column name contains the name. I'm not sure it is the desired behavior.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org