[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...

TRANTANKHOA Sat, 31 Mar 2018 16:21:25 -0700

Github user TRANTANKHOA commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20947#discussion_r178444354
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1593,7 +1596,9 @@ class Dataset[T] private[sql](
       def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
         val colNames: Seq[String] = col1 +: cols
         RelationalGroupedDataset(
    -      toDF(), colNames.map(colName => resolve(colName)), 
RelationalGroupedDataset.GroupByType)
    +      toDF(),
    +      colNames.distinct.map(colName => resolve(colName)),
    --- End diff --
    
    Yes, this ticket is only about making this behavior change. So the question 
really is if the team think that it is the expected behavior. I personally find 
it helps eliminate bugs in our ETL. I don't think anyone need to have 
duplicated columns in their grouped data set by intention.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20947: [SPARK-23705][SQL]Handle non-distinct columns in ...

Reply via email to