Yes, I agree. I'll talk a little about a couple of the constraints of this as well.
On Fri, May 24, 2019 at 5:52 AM Anton Okolnychyi <aokolnyc...@apple.com> wrote: > The agenda looks good to me. I think it would also make sense to clarify > the responsibilities of query engines and Iceberg. Not only in terms of > uniqueness, but also in terms of applying diffs on read, for example. > > On 23 May 2019, at 01:59, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > Here’s a rough agenda: > > - Use cases: everyone come with a use case that you’d like to have > supported. We’ll go around and introduce ourselves and our use cases. > - Main topic: How should Iceberg identify rows that are deleted? > - Side topics from my initial email, if we have time: should we use > insert diffs, should we support dense and sparse formats, etc. > > The main topic I think we should discuss is: *How should Iceberg identify > rows that are deleted?* > > I’m phrasing it this way to avoid where I think we’re talking past one > another because we are making assumptions. The important thing is that > there are two main options: > > - Filename and position, vs > - Specific values of (few) columns in the data > > This phrasing also avoids discussing uniqueness constraints. Once we get > down to behavior, I think we agree. For example, I think we all agree that > uniqueness cannot be enforced in Iceberg. > > If uniqueness can’t be enforced in Iceberg, the main choice comes down to > how we identify rows that are deleted. If we use (filename, position) then > we know that there is only one row. On the other hand, if we use data > values to identify rows then a delete may identify more than one row > because there are no uniqueness guarantees. I think we also agree that if > there is more than one row identified, all of them should be deleted. > > At that point, there are trade-offs between the approaches: > > - When identifying deleted rows by data values, situations like the > one that Anton pointed out are possible. > - Jacques also had a good point about concurrency. If at all possible, > we want to be able to reconcile changes between concurrent commits without > re-running an operation. > > Sound like a reasonable amount to talk through? > > rb > > On Wed, May 22, 2019 at 1:17 PM Erik Wright <erik.wri...@shopify.com> > wrote: > >> >> >> On Wed, May 22, 2019 at 4:04 PM Cristian Opris <cop...@apple.com.invalid> >> wrote: >> >>> Agreed with Erik here, we're certainly not looking to build the >>> equivalent of a relational database, and for that matter not even that of a >>> local disk storage analytics database (like Vertica). Those are very >>> different designs with very different trade-offs and optimizations. >>> >>> We're looking to automate and optimize specific types of file >>> manipulation for large files on remote storage, while presenting that to >>> the user under the common SQL API for *bulk* data manipulation (MERGE >>> INTO) >>> >> >> What I would encourage is to decouple the storage model from the >> implementation of that API. If Iceberg has support for merge-on-read of >> upserts and deletes, in addition to its powerful support for partitioning, >> it will be easy for a higher-level application to implement those APIs >> given certain other constraints (that might not be appropriate to all >> applications). >> >> Myself and Miguel are out on Friday, but Anton should be able to handle >>> the discussion on our side. >>> >>> >>> Thanks, >>> Cristian >>> >>> >>> On 22 May 2019, at 17:51, Erik Wright <erik.wri...@shopify.com.INVALID> >>> wrote: >>> >>> We have two rows with the same natural key and we use that natural key >>>> in diff files: >>>> nk | col1 | col2 >>>> 1 | 1 | 1 >>>> 1 | 2 | 2 >>>> Then we have a delete statement: >>>> DELETE FROM t WHERE col1 = 1 >>> >>> >>> I think this example cuts to the point of the differences of >>> understanding. Does Iceberg want to be approaching the utility of a >>> relational database, against which I can execute complex update queries? >>> This is not what I would have imagined. >>> >>> I would have, instead, imagined that it was up to the client to >>> identify, through whatever means, that they want to update or delete a row >>> with a given ID. If there are multiple (distinct) rows with the same ID, >>> _too bad_. Any user should _expect_ that they could potentially see any one >>> or more of those rows at read time. And that an upsert/delete would affect >>> any/all of them (I would argue for all). >>> >>> *In summary:* Instead of trying to come up with a consistent, logical >>> handling for complex queries that are best suited for a relational >>> database, leave such handling up to the client and concentrate on problems >>> that can be solved simply and more generally. >>> >>> On Wed, May 22, 2019 at 12:11 PM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> Yes, I think we should. I was going to propose one after catching up on >>>> the rest of this thread today. >>>> >>>> On Wed, May 22, 2019 at 9:08 AM Anton Okolnychyi <aokolnyc...@apple.com> >>>> wrote: >>>> >>>>> Thanks! Would it make sense to discuss the agenda in advance? >>>>> >>>>> On 22 May 2019, at 17:04, Ryan Blue <rb...@netflix.com.INVALID> wrote: >>>>> >>>>> I sent out an invite and included everyone on this thread. If anyone >>>>> else would like to join, please join the Zoom meeting. If you'd like to be >>>>> added to the calendar invite, just let me know and I'll add you. >>>>> >>>>> On Wed, May 22, 2019 at 8:57 AM Jacques Nadeau <jacq...@dremio.com> >>>>> wrote: >>>>> >>>>>> works for me. >>>>>> >>>>>> To make things easier, we can use my zoom meeting if people like: >>>>>> >>>>>> Join Zoom Meeting >>>>>> https://zoom.us/j/4157302092 >>>>>> >>>>>> One tap mobile >>>>>> +16465588656,,4157302092# US (New York) >>>>>> +16699006833,,4157302092# US (San Jose) >>>>>> >>>>>> Dial by your location >>>>>> +1 646 558 8656 US (New York) >>>>>> +1 669 900 6833 US (San Jose) >>>>>> 877 853 5257 US Toll-free >>>>>> 888 475 4499 US Toll-free >>>>>> Meeting ID: 415 730 2092 >>>>>> Find your local number: https://zoom.us/u/aH9XYBfm >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jacques Nadeau >>>>>> CTO and Co-Founder, Dremio >>>>>> >>>>>> >>>>>> On Wed, May 22, 2019 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>> wrote: >>>>>> >>>>>>> 9AM on Friday works best for me. How about then? >>>>>>> >>>>>>> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi < >>>>>>> aokolnyc...@apple.com> wrote: >>>>>>> >>>>>>>> What about this Friday? One hour slot from 9:00 to 10:00 am or >>>>>>>> 10:00 to 11:00 am PST? Some folks are based in London, so meeting later >>>>>>>> than this is hard. If Friday doesn’t work, we can consider Tuesday or >>>>>>>> Wednesday next week. >>>>>>>> >>>>>>>> On 22 May 2019, at 00:54, Jacques Nadeau <jacq...@dremio.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I agree with Anton that we should probably spend some time on >>>>>>>> hangouts further discussing things. Definitely differing expectations >>>>>>>> here >>>>>>>> and we seem to be talking a bit past each other. >>>>>>>> -- >>>>>>>> Jacques Nadeau >>>>>>>> CTO and Co-Founder, Dremio >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris < >>>>>>>> cop...@apple.com.invalid> wrote: >>>>>>>> >>>>>>>>> I love a good flame war :P >>>>>>>>> >>>>>>>>> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> That's my point, truly independent writers (two Spark jobs, or a >>>>>>>>>> Spark job and Dremio job) means a distributed transaction. It would >>>>>>>>>> need >>>>>>>>>> yet another external transaction coordinator on top of both Spark and >>>>>>>>>> Dremio, Iceberg by itself >>>>>>>>>> cannot solve this. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I'm not ready to accept this. Iceberg already supports a set of >>>>>>>>> semantics around multiple writers committing simultaneously and how >>>>>>>>> conflict resolution is done. The same can be done here. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> MVCC (which is what Iceberg tries to implement) requires a total >>>>>>>>> ordering of snapshots. Also the snapshots need to be non-conflicting. >>>>>>>>> I >>>>>>>>> really don't see how any metadata data structures can solve this >>>>>>>>> without an >>>>>>>>> outside coordinator. >>>>>>>>> >>>>>>>>> Consider this: >>>>>>>>> >>>>>>>>> Snapshot 0: (K,A) = 1 >>>>>>>>> Job X: UPDATE K SET A=A+1 >>>>>>>>> Job Y: UPDATE K SET A=10 >>>>>>>>> >>>>>>>>> What should the final value of A be and who decides ? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> By single writer, I don't mean single process, I mean multiple >>>>>>>>>> coordinated processes like Spark executors coordinated by Spark >>>>>>>>>> driver. The >>>>>>>>>> coordinator ensures that the data is pre-partitioned on >>>>>>>>>> each executor, and the coordinator commits the snapshot. >>>>>>>>>> >>>>>>>>>> Note however that single writer job/multiple concurrent reader >>>>>>>>>> jobs is perfectly feasible, i.e. it shouldn't be a problem to write >>>>>>>>>> from a >>>>>>>>>> Spark job and read from multiple Dremio queries concurrently (for >>>>>>>>>> example) >>>>>>>>>> >>>>>>>>> >>>>>>>>> :D This is still "single process" from my perspective. That >>>>>>>>> process may be coordinating other processes to do distributed work but >>>>>>>>> ultimately it is a single process. >>>>>>>>> >>>>>>>>> >>>>>>>>> Fair enough >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> I'm not sure what you mean exactly. If we can't enforce >>>>>>>>>> uniqueness we shouldn't assume it. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I disagree. We can specify that as a requirement and state that >>>>>>>>> you'll get unintended consequences if you provide your own keys and >>>>>>>>> don't >>>>>>>>> maintain this. >>>>>>>>> >>>>>>>>> >>>>>>>>> There's no need for unintended consequences, we can specify >>>>>>>>> consistent behaviour (and I believe the document says what that is) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> We do expect that most of the time the natural key is unique, but >>>>>>>>>> the eager and lazy with natural key designs can handle duplicates >>>>>>>>>> consistently. Basically it's not a problem to have duplicate >>>>>>>>>> natural keys, everything works fine. >>>>>>>>>> >>>>>>>>> >>>>>>>>> That heavily depends on how things are implemented. For example, >>>>>>>>> we may write a bunch of code that generates internal data structures >>>>>>>>> based >>>>>>>>> on this expectation. If we have to support duplicate matches, all of >>>>>>>>> sudden >>>>>>>>> we can no longer size various data structures to improve performance >>>>>>>>> and >>>>>>>>> may be unable to preallocate memory associated with a guaranteed >>>>>>>>> completion. >>>>>>>>> >>>>>>>>> >>>>>>>>> Again we need to operate on the assumption that this is a large >>>>>>>>> scale distributed compute/remote storage scenario. Key matching is >>>>>>>>> done >>>>>>>>> with shuffles with data movement across the network, such >>>>>>>>> optimizations >>>>>>>>> would really have little impact on overall performance. Not to >>>>>>>>> mention that >>>>>>>>> most query engines would already optimize the shuffle already as much >>>>>>>>> as it >>>>>>>>> can be optimized. >>>>>>>>> >>>>>>>>> It is true that if actual duplicate keys would make the key >>>>>>>>> matching join (anti-join) somewhat more expensive, however it can be >>>>>>>>> done >>>>>>>>> in such a way that if the keys are in practice unique the join is as >>>>>>>>> efficient as it can be. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Let me try and clarify each point: >>>>>>>>>> >>>>>>>>>> - lookup for query or update on a non-(partition/bucket/sort) key >>>>>>>>>> predicate implies scanning large amounts of data - because these are >>>>>>>>>> the >>>>>>>>>> only data structures that can narrow down the lookup, right ? One >>>>>>>>>> could >>>>>>>>>> argue that the min/max index (file skipping) can be applied to any >>>>>>>>>> column, >>>>>>>>>> but in reality if that column is not sorted the min/max intervals >>>>>>>>>> can have >>>>>>>>>> huge overlaps so it may be next to useless. >>>>>>>>>> - remote storage - this is a critical architecture decision - >>>>>>>>>> implementations on local storage imply a vastly different design for >>>>>>>>>> the >>>>>>>>>> entire system, storage and compute. >>>>>>>>>> - deleting single records per snapshot is unfeasible in eager but >>>>>>>>>> also particularly in the lazy design: each deletion creates a very >>>>>>>>>> small >>>>>>>>>> snapshot. Deleting 1 million records one at a time would create 1 >>>>>>>>>> million >>>>>>>>>> small files, and 1 million RPC calls. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Why is this unfeasible? If I have a dataset of 100mm files >>>>>>>>> including 1mm small files, is that a major problem? It seems like your >>>>>>>>> usecase isn't one where you want to support single record deletes but >>>>>>>>> it is >>>>>>>>> definitely something important to many people. >>>>>>>>> >>>>>>>>> >>>>>>>>> 100 mm total files or 1 mm files per dataset is definitely a >>>>>>>>> problem on HDFS, and I believe on S3 too. Single key delete would >>>>>>>>> work just >>>>>>>>> fine, but it's simply not optimal to do that on remote storage. This >>>>>>>>> is a >>>>>>>>> very well known problem with HDFS, and one of the very reasons to have >>>>>>>>> something like Iceberg in the first place. >>>>>>>>> >>>>>>>>> Basically the users would be able to do single key mutation, but >>>>>>>>> it's not the use case we should be optimizing for, but it's really not >>>>>>>>> advisable. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Eager is conceptually just lazy + compaction done, well, eagerly. >>>>>>>>>> The logic for both is exactly the same, the trade-off is just that >>>>>>>>>> with >>>>>>>>>> eager you implicitly compact every time so that you don't do any >>>>>>>>>> work on >>>>>>>>>> read, while with lazy >>>>>>>>>> you want to amortize the cost of compaction over multiple >>>>>>>>>> snapshots. >>>>>>>>>> >>>>>>>>>> Basically there should be no difference between the two >>>>>>>>>> conceptually, or with regard to keys, etc. The only difference is >>>>>>>>>> some >>>>>>>>>> mechanics in implementation. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think you have deconstruct the problem too much to say these are >>>>>>>>> the same (or at least that is what I'm starting to think given this >>>>>>>>> thread). It seems like real world implementation decisions (per our >>>>>>>>> discussion here) are in conflict. For example, you just argued against >>>>>>>>> having a 1mm arbitrary mutations but I think that is because you >>>>>>>>> aren't >>>>>>>>> thinking about things over time with a delta implementation. Having >>>>>>>>> 10,000 >>>>>>>>> mutations a day where we do delta compaction once a week >>>>>>>>> >>>>>>>>> and local file mappings (key to offset sparse bitmaps) seems like >>>>>>>>> it could result in very good performance in a case where we're >>>>>>>>> mutating >>>>>>>>> small amounts of data. In this scenario, you may not do major >>>>>>>>> compaction >>>>>>>>> ever unless you get to a high enough percentage of records that have >>>>>>>>> been >>>>>>>>> deleted in the original dataset. That drives a very different set of >>>>>>>>> implementation decisions from a situation where you're trying to >>>>>>>>> restate an >>>>>>>>> entire partition at once. >>>>>>>>> >>>>>>>>> >>>>>>>>> We operate on 1 billion mutations per day at least. This is the >>>>>>>>> problem Iceberg wants to solve, I believe it's stated upfront. >>>>>>>>> 10000/day is >>>>>>>>> not a big data problem. It can be done fairly trivially and it would >>>>>>>>> be >>>>>>>>> supported, but there's not much point in extra optimizing for this >>>>>>>>> use case >>>>>>>>> I believe. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix