Upserts in Iceberg

Erik Wright Wed, 22 May 2019 09:52:35 -0700

>
> We have two rows with the same natural key and we use that natural key in
> diff files:
> nk | col1 | col2
> 1 | 1 | 1
> 1 | 2 | 2
> Then we have a delete statement:
> DELETE FROM t WHERE col1 = 1



I think this example cuts to the point of the differences of understanding.
Does Iceberg want to be approaching the utility of a relational database,
against which I can execute complex update queries? This is not what I
would have imagined.

I would have, instead, imagined that it was up to the client to identify,
through whatever means, that they want to update or delete a row with a
given ID. If there are multiple (distinct) rows with the same ID, _too
bad_. Any user should _expect_ that they could potentially see any one or
more of those rows at read time. And that an upsert/delete would affect
any/all of them (I would argue for all).

*In summary:* Instead of trying to come up with a consistent, logical
handling for complex queries that are best suited for a relational
database, leave such handling up to the client and concentrate on problems
that can be solved simply and more generally.

On Wed, May 22, 2019 at 12:11 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Yes, I think we should. I was going to propose one after catching up on
> the rest of this thread today.
>
> On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi <aokolnyc...@apple.com>
> wrote:
>
>> Thanks! Would it make sense to discuss the agenda in advance?
>>
>> On 22 May 2019, at 17:04, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>
>> I sent out an invite and included everyone on this thread. If anyone else
>> would like to join, please join the Zoom meeting. If you'd like to be added
>> to the calendar invite, just let me know and I'll add you.
>>
>> On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau <jacq...@dremio.com>
>> wrote:
>>
>>> works for me.
>>>
>>> To make things easier, we can use my zoom meeting if people like:
>>>
>>> Join Zoom Meeting
>>> https://zoom.us/j/4157302092
>>>
>>> One tap mobile
>>> +16465588656,,4157302092# US (New York)
>>> +16699006833,,4157302092# US (San Jose)
>>>
>>> Dial by your location
>>>         +1 646 558 8656 US (New York)
>>>         +1 669 900 6833 US (San Jose)
>>>         877 853 5257 US Toll-free
>>>         888 475 4499 US Toll-free
>>> Meeting ID: 415 730 2092
>>> Find your local number: https://zoom.us/u/aH9XYBfm
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, May 22, 2019 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> 9AM on Friday works best for me. How about then?
>>>>
>>>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi <aokolnyc...@apple.com>
>>>> wrote:
>>>>
>>>>> What about this Friday? One hour slot from 9:00 to 10:00 am or 10:00
>>>>> to 11:00 am PST? Some folks are based in London, so meeting later than 
>>>>> this
>>>>> is hard. If Friday doesn’t work, we can consider Tuesday or Wednesday next
>>>>> week.
>>>>>
>>>>> On 22 May 2019, at 00:54, Jacques Nadeau <jacq...@dremio.com> wrote:
>>>>>
>>>>> I agree with Anton that we should probably spend some time on hangouts
>>>>> further discussing things. Definitely differing expectations here and we
>>>>> seem to be talking a bit past each other.
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>>
>>>>>
>>>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris <
>>>>> cop...@apple.com.invalid> wrote:
>>>>>
>>>>>> I love a good flame war :P
>>>>>>
>>>>>> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com> wrote:
>>>>>>
>>>>>>
>>>>>> That's my point, truly independent writers (two Spark jobs, or a
>>>>>>> Spark job and Dremio job) means a distributed transaction. It would need
>>>>>>> yet another external transaction coordinator on top of both Spark and
>>>>>>> Dremio, Iceberg by itself
>>>>>>> cannot solve this.
>>>>>>>
>>>>>>
>>>>>> I'm not ready to accept this. Iceberg already supports a set of
>>>>>> semantics around multiple writers committing simultaneously and how
>>>>>> conflict resolution is done. The same can be done here.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> MVCC (which is what Iceberg tries to implement) requires a total
>>>>>> ordering of snapshots. Also the snapshots need to be non-conflicting. I
>>>>>> really don't see how any metadata data structures can solve this without 
>>>>>> an
>>>>>> outside coordinator.
>>>>>>
>>>>>> Consider this:
>>>>>>
>>>>>> Snapshot 0: (K,A) = 1
>>>>>> Job X: UPDATE K SET A=A+1
>>>>>> Job Y: UPDATE K SET A=10
>>>>>>
>>>>>> What should the final value of A be and who decides ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> By single writer, I don't mean single process, I mean multiple
>>>>>>> coordinated processes like Spark executors coordinated by Spark driver. 
>>>>>>> The
>>>>>>> coordinator ensures that the data is pre-partitioned on
>>>>>>> each executor, and the coordinator commits the snapshot.
>>>>>>>
>>>>>>> Note however that single writer job/multiple concurrent reader jobs
>>>>>>> is perfectly feasible, i.e. it shouldn't be a problem to write from a 
>>>>>>> Spark
>>>>>>> job and read from multiple Dremio queries concurrently (for example)
>>>>>>>
>>>>>>
>>>>>> :D This is still "single process" from my perspective. That process
>>>>>> may be coordinating other processes to do distributed work but ultimately
>>>>>> it is a single process.
>>>>>>
>>>>>>
>>>>>> Fair enough
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I'm not sure what you mean exactly. If we can't enforce uniqueness
>>>>>>> we shouldn't assume it.
>>>>>>>
>>>>>>
>>>>>> I disagree. We can specify that as a requirement and state that
>>>>>> you'll get unintended consequences if you provide your own keys and don't
>>>>>> maintain this.
>>>>>>
>>>>>>
>>>>>> There's no need for unintended consequences, we can specify
>>>>>> consistent behaviour (and I believe the document says what that is)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> We do expect that most of the time the natural key is unique, but
>>>>>>> the eager and lazy with natural key designs can handle duplicates
>>>>>>> consistently. Basically it's not a problem to have duplicate natural
>>>>>>> keys, everything works fine.
>>>>>>>
>>>>>>
>>>>>> That heavily depends on how things are implemented. For example, we
>>>>>> may write a bunch of code that generates internal data structures based 
>>>>>> on
>>>>>> this expectation. If we have to support duplicate matches, all of sudden 
>>>>>> we
>>>>>> can no longer size various data structures to improve performance and may
>>>>>> be unable to preallocate memory associated with a guaranteed completion.
>>>>>>
>>>>>>
>>>>>> Again we need to operate on the assumption that this is a large scale
>>>>>> distributed compute/remote storage scenario. Key matching is done with
>>>>>> shuffles with data movement across the network, such optimizations would
>>>>>> really have little impact on overall performance. Not to mention that 
>>>>>> most
>>>>>> query engines would already optimize the shuffle already as much as it 
>>>>>> can
>>>>>> be optimized.
>>>>>>
>>>>>> It is true that if actual duplicate keys would make the key matching
>>>>>> join (anti-join) somewhat more expensive, however it can be done in such 
>>>>>> a
>>>>>> way that if the keys are in practice unique the join is as efficient as 
>>>>>> it
>>>>>> can be.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Let me try and clarify each point:
>>>>>>>
>>>>>>> - lookup for query or update on a non-(partition/bucket/sort) key
>>>>>>> predicate implies scanning large amounts of data - because these are the
>>>>>>> only data structures that can narrow down the lookup, right ? One could
>>>>>>> argue that the min/max index (file skipping) can be applied to any 
>>>>>>> column,
>>>>>>> but in reality if that column is not sorted the min/max intervals can 
>>>>>>> have
>>>>>>> huge overlaps so it may be next to useless.
>>>>>>> - remote storage - this is a critical architecture decision -
>>>>>>> implementations on local storage imply a vastly different design for the
>>>>>>> entire system, storage and compute.
>>>>>>> - deleting single records per snapshot is unfeasible in eager but
>>>>>>> also particularly in the lazy design: each deletion creates a very small
>>>>>>> snapshot. Deleting 1 million records one at a time would create 1 
>>>>>>> million
>>>>>>> small files, and 1 million RPC calls.
>>>>>>>
>>>>>>
>>>>>> Why is this unfeasible? If I have a dataset of 100mm files including
>>>>>> 1mm small files, is that a major problem? It seems like your usecase 
>>>>>> isn't
>>>>>> one where you want to support single record deletes but it is definitely
>>>>>> something important to many people.
>>>>>>
>>>>>>
>>>>>> 100 mm total files or 1 mm files per dataset is definitely a problem
>>>>>> on HDFS, and I believe on S3 too. Single key delete would work just fine,
>>>>>> but it's simply not optimal to do that on remote storage. This is a very
>>>>>> well known problem with HDFS, and one of the very reasons to have 
>>>>>> something
>>>>>> like Iceberg in the first place.
>>>>>>
>>>>>> Basically the users would be able to do single key mutation, but it's
>>>>>> not the use case we should be optimizing for, but it's really not 
>>>>>> advisable.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Eager is conceptually just lazy + compaction done, well, eagerly.
>>>>>>> The logic for both is exactly the same, the trade-off is just that with
>>>>>>> eager you implicitly compact every time so that you don't do any work on
>>>>>>> read, while with lazy
>>>>>>> you want to amortize the cost of compaction over multiple snapshots.
>>>>>>>
>>>>>>> Basically there should be no difference between the two
>>>>>>> conceptually, or with regard to keys, etc. The only difference is some
>>>>>>> mechanics in implementation.
>>>>>>>
>>>>>>
>>>>>> I think you have deconstruct the problem too much to say these are
>>>>>> the same (or at least that is what I'm starting to think given this
>>>>>> thread). It seems like real world implementation decisions (per our
>>>>>> discussion here) are in conflict. For example, you just argued against
>>>>>> having a 1mm arbitrary mutations but I think that is because you aren't
>>>>>> thinking about things over time with a delta implementation. Having 
>>>>>> 10,000
>>>>>> mutations a day where we do delta compaction once a week
>>>>>>
>>>>>> and local file mappings (key to offset sparse bitmaps) seems like it
>>>>>> could result in very good performance in a case where we're mutating 
>>>>>> small
>>>>>> amounts of data. In this scenario, you may not do major compaction ever
>>>>>> unless you get to a high enough percentage of records that have been
>>>>>> deleted in the original dataset. That drives a very different set of
>>>>>> implementation decisions from a situation where you're trying to restate 
>>>>>> an
>>>>>> entire partition at once.
>>>>>>
>>>>>>
>>>>>> We operate on 1 billion mutations per day at least. This is the
>>>>>> problem Iceberg wants to solve, I believe it's stated upfront. 10000/day 
>>>>>> is
>>>>>> not a big data problem. It can be done fairly trivially and it would be
>>>>>> supported, but there's not much point in extra optimizing for this use 
>>>>>> case
>>>>>> I believe.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to