[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yu Wang updated SPARK-23705: ---------------------------- Comment: was deleted (was: [~khoatrantan2000] Could you assign this patch to me?) > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > ------------------------------------------------------------------------------ > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Khoa Tran > Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org