[ https://issues.apache.org/jira/browse/DATAFU-177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eyal Allweil updated DATAFU-177: -------------------------------- Summary: Add dedupByAllExcept (was: Add dedupAllExceptBy) > Add dedupByAllExcept > -------------------- > > Key: DATAFU-177 > URL: https://issues.apache.org/jira/browse/DATAFU-177 > Project: DataFu > Issue Type: Improvement > Affects Versions: 2.1.0 > Reporter: Eyal Allweil > Assignee: Eyal Allweil > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > A new method for when you want to de-duplicate records, but not lose any > "real" data. > For example if a server creates events with an autogenerated event id, and > sometimes > events are duplicated. You don't want double rows just for the event ids, but > if any of the other fields are distinct you want to keep the rows (with their > original event ids) - otherwise you'd just drop the event id column. In order > to keep at least one value you need to tediously list all the other columns. > The API as I implemented it looks like this: > {code:java} > /** > * @param df > * @param ignoredColumn The one column whose value you need only one of > * @param aggFunction Default is max > * @return DataFrame > */ > def dedupByAllExcept(df: DataFrame, ignoredColumn: String, aggFunction : > String => Column = functions.max): DataFrame > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)