Upserts in Iceberg

Anton Okolnychyi Tue, 21 May 2019 15:34:34 -0700

I would propose to have a series of sessions over hangouts to clarify all 
pending points. We can start this week if there is a timeslot that works for 
everyone.


Potential topics (feel free to suggest yours):
- Use cases

I believe it is critical that everyone is on the same page when it comes to our 
target use cases. There were some comments in the doc and in this thread 
related to this topic. I think we should start with this.

- Should Iceberg support both eager and lazy approaches?

Iceberg is a table format and it looks reasonable to support both. One or 
another might be more beneficial depending on a particular use case.

- Synthetic vs natural keys in diff files

This is a fundamental design decision and we need to ensure that it works for 
every target use case.

I would like to emphasize the difference between having a natural key 
constraint and using a natural key in diff files.

It seems we agree that it is not feasible to ensure the uniqueness of natural 
keys in Iceberg itself. As a table format, Iceberg can and in fact should have 
a notion of a natural key but query engines would need to respect that while 
writing. If Iceberg cannot guarantee the uniqueness of natural keys, this 
raises a question if we can rely on that uniqueness in our diff files. For 
example, if a query engine violates the natural key constraint and inserts two 
rows with the same natural key, we might be in a bad situation while doing 
updates/deletes. Consider an example below.

We have two rows with the same natural key and we use that natural key in diff 
files:
nk | col1 | col2
1 | 1 | 1
1 | 2 | 2

Then we have a delete statement:
DELETE FROM t WHERE col1 = 1

If we use the natural key in diff files, then we will delete both records even 
though only one of them matched our delete predicate.

So, the only way to make this reliable is either by enforcing the uniqueness of 
the natural keys or by limiting operations that are supported (i.e. we can use 
predicated on the natural key columns ONLY in our update/delete/upserts).

Synthetic keys don't have that problem as we can ensure their uniqueness in 
Iceberg itself. So, I would disagree with Cristian that we don't see any 
benefits of synthetic keys. My personal opinion is that we should keep this 
question open and properly discuss it during our session. The same goes for 
having multiple writers and conflict resolution. Iceberg already has some 
support for this and we shouldn’t break it by introducing 
updates/deletes/upserts.

Thanks,
Anton

> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com> wrote:
> 
> 
> That's my point, truly independent writers (two Spark jobs, or a Spark job 
> and Dremio job) means a distributed transaction. It would need yet another 
> external transaction coordinator on top of both Spark and Dremio, Iceberg by 
> itself 
> cannot solve this.
>  
> I'm not ready to accept this. Iceberg already supports a set of semantics 
> around multiple writers committing simultaneously and how conflict resolution 
> is done. The same can be done here.
>  
> By single writer, I don't mean single process, I mean multiple coordinated 
> processes like Spark executors coordinated by Spark driver. The coordinator 
> ensures that the data is pre-partitioned on 
> each executor, and the coordinator commits the snapshot. 
> 
> Note however that single writer job/multiple concurrent reader jobs is 
> perfectly feasible, i.e. it shouldn't be a problem to write from a Spark job 
> and read from multiple Dremio queries concurrently (for example)
> 
> :D This is still "single process" from my perspective. That process may be 
> coordinating other processes to do distributed work but ultimately it is a 
> single process. 
>  
> I'm not sure what you mean exactly. If we can't enforce uniqueness we 
> shouldn't assume it.
>  
> I disagree. We can specify that as a requirement and state that you'll get 
> unintended consequences if you provide your own keys and don't maintain this.
>  
> We do expect that most of the time the natural key is unique, but the eager 
> and lazy with natural key designs can handle duplicates 
> consistently. Basically it's not a problem to have duplicate natural keys, 
> everything works fine.
> 
> That heavily depends on how things are implemented. For example, we may write 
> a bunch of code that generates internal data structures based on this 
> expectation. If we have to support duplicate matches, all of sudden we can no 
> longer size various data structures to improve performance and may be unable 
> to preallocate memory associated with a guaranteed completion.
> 
> Let me try and clarify each point:
> 
> - lookup for query or update on a non-(partition/bucket/sort) key predicate 
> implies scanning large amounts of data - because these are the only data 
> structures that can narrow down the lookup, right ? One could argue that the 
> min/max index (file skipping) can be applied to any column, but in reality if 
> that column is not sorted the min/max intervals can have huge overlaps so it 
> may be next to useless.
> - remote storage - this is a critical architecture decision - implementations 
> on local storage imply a vastly different design for the entire system, 
> storage and compute. 
> - deleting single records per snapshot is unfeasible in eager but also 
> particularly in the lazy design: each deletion creates a very small snapshot. 
> Deleting 1 million records one at a time would create 1 million small files, 
> and 1 million RPC calls.
> 
> Why is this unfeasible? If I have a dataset of 100mm files including 1mm 
> small files, is that a major problem? It seems like your usecase isn't one 
> where you want to support single record deletes but it is definitely 
> something important to many people.
>  
> Eager is conceptually just lazy + compaction done, well, eagerly. The logic 
> for both is exactly the same, the trade-off is just that with eager you 
> implicitly compact every time so that you don't do any work on read, while 
> with lazy 
> you want to amortize the cost of compaction over multiple snapshots.
> 
> Basically there should be no difference between the two conceptually, or with 
> regard to keys, etc. The only difference is some mechanics in implementation.
> 
> I think you have deconstruct the problem too much to say these are the same 
> (or at least that is what I'm starting to think given this thread). It seems 
> like real world implementation decisions (per our discussion here) are in 
> conflict. For example, you just argued against having a 1mm arbitrary 
> mutations but I think that is because you aren't thinking about things over 
> time with a delta implementation. Having 10,000 mutations a day where we do 
> delta compaction once a week and local file mappings (key to offset sparse 
> bitmaps) seems like it could result in very good performance in a case where 
> we're mutating small amounts of data. In this scenario, you may not do major 
> compaction ever unless you get to a high enough percentage of records that 
> have been deleted in the original dataset. That drives a very different set 
> of implementation decisions from a situation where you're trying to restate 
> an entire partition at once.

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to