If you only care that you're deduping on one of the fields you could add an
index and count like so:

df3 = df1.withColumn('idx',lit(1))
.union(df2.withColumn('idx',lit(2))

remove_df = df3
.groupBy('id')
.agg(collect_set('idx').alias('set_size')
.filter(size(col('set_size') > 1))
.select('id', lit(2).alias('idx'))

# the duplicated ids in the above are now coded for df2, so only those will
be dropped

df3.join(remove_df, on=['id','idx'], how='leftanti')

On Fri, Sep 13, 2019 at 11:44 AM Abhinesh Hada <abhinesh...@gmail.com>
wrote:

> Hi,
>
> I am trying to take union of 2 dataframes and then drop duplicate based on
> the value of a specific column. But, I want to make sure that while
> dropping duplicates, the rows from first data frame are kept.
>
> Example:
> df1 = df1.union(df2).dropDuplicates(['id'])
>
>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply via email to