As Jay suggested correctly, if you're joining then overwrite otherwise only append as it removes dups.
I think, in this scenario, just change it to write.mode('overwrite') because you're already reading the old data and your job would be done. On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim, <bbuil...@gmail.com> wrote: > Hi Jay, > > Thanks for your response. Are you saying to append the new data and then > remove the duplicates to the whole data set afterwards overwriting the > existing data set with new data set with appended values? I will give that > a try. > > Cheers, > Ben > > On Fri, Jun 1, 2018 at 11:49 PM Jay <jayadeep.jayara...@gmail.com> wrote: > >> Benjamin, >> >> The append will append the "new" data to the existing data with removing >> the duplicates. You would need to overwrite the file everytime if you need >> unique values. >> >> Thanks, >> Jayadeep >> >> On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim <bbuil...@gmail.com> wrote: >> >>> I have a situation where I trying to add only new rows to an existing >>> data set that lives in S3 as gzipped parquet files, looping and appending >>> for each hour of the day. First, I create a DF from the existing data, then >>> I use a query to create another DF with the data that is new. Here is the >>> code snippet. >>> >>> df = spark.read.parquet(existing_data_path) >>> df.createOrReplaceTempView(‘existing_data’) >>> new_df = spark.read.parquet(new_data_path) >>> new_df.createOrReplaceTempView(’new_data’) >>> append_df = spark.sql( >>> """ >>> WITH ids AS ( >>> SELECT DISTINCT >>> source, >>> source_id, >>> target, >>> target_id >>> FROM new_data i >>> LEFT ANTI JOIN existing_data im >>> ON i.source = im.source >>> AND i.source_id = im.source_id >>> AND i.target = im.target >>> AND i.target = im.target_id >>> """ >>> ) >>> append_df.coalesce(1).write.parquet(existing_data_path, mode='append', >>> compression='gzip’) >>> >>> >>> I thought this would append new rows and keep the data unique, but I am >>> see many duplicates. Can someone help me with this and tell me what I am >>> doing wrong? >>> >>> Thanks, >>> Ben >>> >>