Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r160829658 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql]( val groupingNamedExpressions = groupingExprs.map { case ne: NamedExpression => ne - case other => Alias(other, other.toString)() + case other => Alias(other, toPrettySQL(other))() } val groupingAttributes = groupingNamedExpressions.map(_.toAttribute) val child = df.logicalPlan val project = Project(groupingNamedExpressions ++ child.output, child) - val output = expr.dataType.asInstanceOf[StructType].toAttributes - val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output, project) + val udfOutput: Seq[Attribute] = expr.dataType.asInstanceOf[StructType].toAttributes + val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]() + + for (attribute <- groupingAttributes) { + if (!udfOutput.map(_.name).contains(attribute.name)) { --- End diff -- @ueshin You brought up a very good point about an issue I struggle a bit with - conflicting column names in grouping column and UDF output. When this happens, we have a few choices: 1. Keep both columns and rename one of them The benefit of this approach is that it gives the user the most information, but might result in arbitrary column names such like `v_`. Also another downside is if the UDF just adds or replace columns, this will result duplicate columns. 2. Keep both columns and don't rename This is consistent with groupby agg behavior, so probably better than (1), but still, will result in duplicate columns. 3. Drop conflict group columns This is the approach implemented in this PR. The reason I choose this is because I think it's a rare case that the user want to change the grouping column and at the same time, want the original grouping column. Therefore, I think it makes most sense to make the user do a bit extra work - explicitly create a another column rather than overriding the grouping column. 4. Drop conflict UDF columns I don't think drop UDF output is reasonable behavior. @ueshin which one do you prefer?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org