Upserts in Iceberg

Erik Wright Wed, 03 Jul 2019 11:52:08 -0700

I have a new version
<https://docs.google.com/document/d/1FMKh_SQ6xSUUmoCA8LerTkzIxDUN5JbStQp5Hzot4eo/edit?usp=sharing>
of the proposal, updated to reflect our discussion.


I have misgivings about the elimination of the unique row ID. It makes
reading the dataset potentially much more expensive. We eliminated it in
order to support the idea of "global" deletes but might want to revisit
whether we need to handle that use case now. If we do, we may want to
handle it in a way that is distinct from the hopefully more common case of
deleting by unique row ID.

On Wed, Jul 3, 2019 at 2:44 PM Owen O'Malley <owen.omal...@gmail.com> wrote:

> It works for me too.
>
> .. Owen
>
> On Jul 3, 2019, at 11:27, Anton Okolnychyi <aokolnyc...@apple.com.invalid>
> wrote:
>
> Works for me too.
>
> On 3 Jul 2019, at 19:09, Erik Wright <erik.wri...@shopify.com.INVALID>
> wrote:
>
> That works for me.
>
> On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> How about 9AM PDT on Friday, 5 July then?
>>
>> On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <owen.omal...@gmail.com>
>> wrote:
>>
>>> I'd like to call in, but I'm out Thursday. Friday would work except 11am
>>> to 1pm pdt.
>>>
>>> .. Owen
>>>
>>> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I'm available Thursday and Friday this week as well, but it's a holiday
>>>> in the US so some people may be out. If there are no objections from anyone
>>>> that would like to attend, then I'm up for that.
>>>>
>>>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <aokolnyc...@apple.com>
>>>> wrote:
>>>>
>>>>> I apologize for the delay on my side. I’ll still have to go through
>>>>> the last emails. I am available on Thursday/Friday this week and would be
>>>>> great to sync.
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>>
>>>>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>>>
>>>>> Sorry I didn't get back to this thread last week. Let's try to have a
>>>>> video call to sync up on this next week. What days would work for 
>>>>> everyone?
>>>>>
>>>>> rb
>>>>>
>>>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <erik.wri...@shopify.com>
>>>>> wrote:
>>>>>
>>>>>> With regards to operation values. Currently they are:
>>>>>>
>>>>>>    - append: data files were added and no files were removed.
>>>>>>    - replace: data files were rewritten with the same data; i.e.,
>>>>>>    compaction, changing the data file format, or relocating data files.
>>>>>>    - overwrite: data files were deleted and added in a logical
>>>>>>    overwrite operation.
>>>>>>    - delete: data files were removed and their contents logically
>>>>>>    deleted.
>>>>>>
>>>>>> If deletion files (with or without data files) are appended to the
>>>>>> dataset, will we consider that an `append` operation? If so, if deletion
>>>>>> and/or data files are appended, and whole files are also deleted, will we
>>>>>> consider that an `overwrite`?
>>>>>>
>>>>>> Given that the only apparent purpose of the operation field is to
>>>>>> optimize snapshot expiration the above seems to meet its needs. An
>>>>>> incremental reader can also skip `replace` snapshots but no others. Once 
>>>>>> it
>>>>>> decides to read a snapshot I don't think there's any difference in how it
>>>>>> processes the data for append/overwrite/delete cases.
>>>>>>
>>>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes,
>>>>>>> since they apply to a specific file. They’re not harmful, but the don’t
>>>>>>> seem relevant.
>>>>>>>
>>>>>>> These delete files will probably contain a path and an offset and
>>>>>>> could contain deletes for multiple files. In that case, the sequence 
>>>>>>> number
>>>>>>> can be used to eliminate delete files that don’t need to be applied to a
>>>>>>> particular data file, just like the column equality deletes. Likewise, 
>>>>>>> it
>>>>>>> can be used to drop the delete files when there are no data files with 
>>>>>>> an
>>>>>>> older sequence number.
>>>>>>>
>>>>>>> I don’t understand the purpose of the min sequence number, nor what
>>>>>>> the “min data seq” is.
>>>>>>>
>>>>>>> Min sequence number would be used for pruning delete files without
>>>>>>> reading all the manifests to find out if there are old data files. If no
>>>>>>> manifest with data for a partition contains a file older than some 
>>>>>>> sequence
>>>>>>> number N, then any delete file with a sequence number < N can be 
>>>>>>> removed.
>>>>>>>
>>>>>> OK, so the minimum sequence number is an attribute of manifest files.
>>>>>> Sounds good. It can likely permit us to optimize compaction operations as
>>>>>> well (i.e., you can easily limit the operation to a subset of manifest
>>>>>> files as long as they are the oldest ones).
>>>>>>
>>>>>>
>>>>>>> The “min data seq” is the minimum sequence number of a data file.
>>>>>>> That seems like what we actually want for the pruning I described above.
>>>>>>>
>>>>>> I would expect a data file (appended rows or deletions by column
>>>>>> value) to have a single sequence number that applies to the whole file.
>>>>>> Even a delete-by-file-and-offset file can do with only a single sequence
>>>>>> number (which must be larger than the sequence numbers of all deleted
>>>>>> files). Why do we need a "minimum" data sequence per file?
>>>>>>
>>>>>>> Off the top of my head [supporting non-key delete] requires adding
>>>>>>> additional information to the manifest file, indicating the columns that
>>>>>>> are used for the deletion. Only equality would be supported; if multiple
>>>>>>> columns were used, they would be combined with boolean-and. I don’t see
>>>>>>> anything too tricky about it.
>>>>>>>
>>>>>>> Yes, exactly. I actually phrased it wrong initially. I think it
>>>>>>> would be simple to extend the equality deletes to do this. We just need 
>>>>>>> a
>>>>>>> way to have global scope, not just partition scope.
>>>>>>>
>>>>>> I don't think anything special needs to be done with regards to
>>>>>> scoping/partitioning of delete files. When scanning one or more data 
>>>>>> files,
>>>>>> one must also consider any and all deletion files that could apply to 
>>>>>> them.
>>>>>> The only way to prune deletion files from consideration is:
>>>>>>
>>>>>>    1. All of your data files have at least one partition column in
>>>>>>    common.
>>>>>>    2. The deletion file is also partitioned on that column (at
>>>>>>    least).
>>>>>>    3. The value sets of the data files do not overlap the value sets
>>>>>>    of the deletion files in that column.
>>>>>>
>>>>>>  So given a dataset of sessions that is partitioned by device form
>>>>>> factor and date, for example, you could have a delete (user_id=9876) in a
>>>>>> deletion file that is not partitioned. And it would be "in scope" for all
>>>>>> of those data files.
>>>>>>
>>>>>> If you had the same dataset partitioned by hash(user_id) and your
>>>>>> deletes were _also_ partitioned by hash(user_id) you would be able to 
>>>>>> prune
>>>>>> those deletes while scanning the sessions.
>>>>>>
>>>>>>> If we add this on a per-deletion file basis it is not clear if there
>>>>>>> is any relevance in preserving the concept of a unique row ID.
>>>>>>>
>>>>>>> Agreed. That’s why I’ve been steering us away from the debate about
>>>>>>> whether keys are unique or not. Either way, a natural key delete must
>>>>>>> delete all of the records it matches.
>>>>>>>
>>>>>>> I would assume that the maximum sequence number should appear in the
>>>>>>> table metadata
>>>>>>>
>>>>>>> Agreed.
>>>>>>>
>>>>>>> [W]ould you make it optional to assign a sequence number to a
>>>>>>> snapshot? “Replace” snapshots would not need one.
>>>>>>>
>>>>>>> The only requirement is that it is monotonically increasing. If one
>>>>>>> isn’t used, we don’t have to increment. I’d say it is up to the
>>>>>>> implementation to decide. I would probably increment it every time to 
>>>>>>> avoid
>>>>>>> errors.
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to