Re: Append In-Place to S3

Aakash Basu Sat, 02 Jun 2018 10:22:57 -0700

As Jay suggested correctly, if you're joining then overwrite otherwise only
append as it removes dups.


I think, in this scenario, just change it to write.mode('overwrite')
because you're already reading the old data and your job would be done.


On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim, <bbuil...@gmail.com> wrote:

> Hi Jay,
>
> Thanks for your response. Are you saying to append the new data and then
> remove the duplicates to the whole data set afterwards overwriting the
> existing data set with new data set with appended values? I will give that
> a try.
>
> Cheers,
> Ben
>
> On Fri, Jun 1, 2018 at 11:49 PM Jay <jayadeep.jayara...@gmail.com> wrote:
>
>> Benjamin,
>>
>> The append will append the "new" data to the existing data with removing
>> the duplicates. You would need to overwrite the file everytime if you need
>> unique values.
>>
>> Thanks,
>> Jayadeep
>>
>> On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> I have a situation where I trying to add only new rows to an existing
>>> data set that lives in S3 as gzipped parquet files, looping and appending
>>> for each hour of the day. First, I create a DF from the existing data, then
>>> I use a query to create another DF with the data that is new. Here is the
>>> code snippet.
>>>
>>> df = spark.read.parquet(existing_data_path)
>>> df.createOrReplaceTempView(‘existing_data’)
>>> new_df = spark.read.parquet(new_data_path)
>>> new_df.createOrReplaceTempView(’new_data’)
>>> append_df = spark.sql(
>>>         """
>>>         WITH ids AS (
>>>             SELECT DISTINCT
>>>                 source,
>>>                 source_id,
>>>                 target,
>>>                 target_id
>>>             FROM new_data i
>>>             LEFT ANTI JOIN existing_data im
>>>             ON i.source = im.source
>>>             AND i.source_id = im.source_id
>>>             AND i.target = im.target
>>>             AND i.target = im.target_id
>>>         """
>>>     )
>>> append_df.coalesce(1).write.parquet(existing_data_path, mode='append',
>>> compression='gzip’)
>>>
>>>
>>> I thought this would append new rows and keep the data unique, but I am
>>> see many duplicates. Can someone help me with this and tell me what I am
>>> doing wrong?
>>>
>>> Thanks,
>>> Ben
>>>
>>

Re: Append In-Place to S3

Reply via email to