Eyal Allweil created DATAFU-177: ----------------------------------- Summary: Add dedupAllExceptBy Key: DATAFU-177 URL: https://issues.apache.org/jira/browse/DATAFU-177 Project: DataFu Issue Type: Improvement Reporter: Eyal Allweil
A new method for when you want to de-duplicate records, but not lose any "real" data. For example if a server creates events with an autogenerated event id, and sometimes events are duplicated. You don't want double rows just for the event ids, but if any of the other fields are distinct you want to keep the rows (with their original event ids) - otherwise you'd just drop the event id column. In order to keep at least one value you need to tediously list all the other columns. The API as I implemented it looks like this: {code:java} /** * @param df * @param ignoredColumn The one column whose value you need only one of * @param aggFunction Default is max * @return DataFrame */ def dedupByAllExcept(df: DataFrame, ignoredColumn: String, aggFunction : String => Column = functions.max): DataFrame {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)