If you only care that you're deduping on one of the fields you could add an index and count like so:
df3 = df1.withColumn('idx',lit(1)) .union(df2.withColumn('idx',lit(2)) remove_df = df3 .groupBy('id') .agg(collect_set('idx').alias('set_size') .filter(size(col('set_size') > 1)) .select('id', lit(2).alias('idx')) # the duplicated ids in the above are now coded for df2, so only those will be dropped df3.join(remove_df, on=['id','idx'], how='leftanti') On Fri, Sep 13, 2019 at 11:44 AM Abhinesh Hada <abhinesh...@gmail.com> wrote: > Hi, > > I am trying to take union of 2 dataframes and then drop duplicate based on > the value of a specific column. But, I want to make sure that while > dropping duplicates, the rows from first data frame are kept. > > Example: > df1 = df1.union(df2).dropDuplicates(['id']) > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016