Eyal Allweil created DATAFU-177:
-----------------------------------
Summary: Add dedupAllExceptBy
Key: DATAFU-177
URL: https://issues.apache.org/jira/browse/DATAFU-177
Project: DataFu
Issue Type: Improvement
Reporter: Eyal Allweil
A new method for when you want to de-duplicate records, but not lose any "real"
data.
For example if a server creates events with an autogenerated event id, and
sometimes
events are duplicated. You don't want double rows just for the event ids, but
if any of the other fields are distinct you want to keep the rows (with their
original event ids) - otherwise you'd just drop the event id column. In order
to keep at least one value you need to tediously list all the other columns.
The API as I implemented it looks like this:
{code:java}
/**
* @param df
* @param ignoredColumn The one column whose value you need only one of
* @param aggFunction Default is max
* @return DataFrame
*/
def dedupByAllExcept(df: DataFrame, ignoredColumn: String, aggFunction :
String => Column = functions.max): DataFrame
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)