Eyal Allweil created DATAFU-177:
-----------------------------------

             Summary: Add dedupAllExceptBy
                 Key: DATAFU-177
                 URL: https://issues.apache.org/jira/browse/DATAFU-177
             Project: DataFu
          Issue Type: Improvement
            Reporter: Eyal Allweil


A new method for when you want to de-duplicate records, but not lose any "real" 
data.

For example if a server creates events with an autogenerated event id, and 
sometimes
events are duplicated. You don't want double rows just for the event ids, but 
if any of the other fields are distinct you want to keep the rows (with their 
original event ids) - otherwise you'd just drop the event id column. In order 
to keep at least one value you need to tediously list all the other columns.

The API as I implemented it looks like this:
{code:java}
  /**
   * @param df
   * @param ignoredColumn The one column whose value you need only one of
   * @param aggFunction Default is max
   * @return DataFrame
   */
  def dedupByAllExcept(df: DataFrame, ignoredColumn: String, aggFunction : 
String => Column = functions.max): DataFrame 
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to