Khoa Tran created SPARK-23705: --------------------------------- Summary: dataframe.groupBy() may inadvertently receive sequence of non-distinct strings Key: SPARK-23705 URL: https://issues.apache.org/jira/browse/SPARK-23705 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Khoa Tran
{code:java} // code placeholder package org.apache.spark.sql . . . class Dataset[T] private[sql]( . . . def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { val colNames: Seq[String] = col1 +: cols RelationalGroupedDataset( toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType) } {code} should append a `.distinct` after `colNames` when used in `groupBy` Not sure if the community agrees with this or it's up to the users to perform the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org