works for me.

To make things easier, we can use my zoom meeting if people like:

Join Zoom Meeting
https://zoom.us/j/4157302092

One tap mobile
+16465588656,,4157302092# US (New York)
+16699006833,,4157302092# US (San Jose)

Dial by your location
        +1 646 558 8656 US (New York)
        +1 669 900 6833 US (San Jose)
        877 853 5257 US Toll-free
        888 475 4499 US Toll-free
Meeting ID: 415 730 2092
Find your local number: https://zoom.us/u/aH9XYBfm






--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, May 22, 2019 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> 9AM on Friday works best for me. How about then?
>
> On Wed, May 22, 2019 at 5:05 AM Anton Okolnychyi <aokolnyc...@apple.com>
> wrote:
>
>> What about this Friday? One hour slot from 9:00 to 10:00 am or 10:00 to
>> 11:00 am PST? Some folks are based in London, so meeting later than this is
>> hard. If Friday doesn’t work, we can consider Tuesday or Wednesday next
>> week.
>>
>> On 22 May 2019, at 00:54, Jacques Nadeau <jacq...@dremio.com> wrote:
>>
>> I agree with Anton that we should probably spend some time on hangouts
>> further discussing things. Definitely differing expectations here and we
>> seem to be talking a bit past each other.
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Tue, May 21, 2019 at 3:44 PM Cristian Opris <cop...@apple.com.invalid>
>> wrote:
>>
>>> I love a good flame war :P
>>>
>>> On 21 May 2019, at 22:57, Jacques Nadeau <jacq...@dremio.com> wrote:
>>>
>>>
>>> That's my point, truly independent writers (two Spark jobs, or a Spark
>>>> job and Dremio job) means a distributed transaction. It would need yet
>>>> another external transaction coordinator on top of both Spark and Dremio,
>>>> Iceberg by itself
>>>> cannot solve this.
>>>>
>>>
>>> I'm not ready to accept this. Iceberg already supports a set of
>>> semantics around multiple writers committing simultaneously and how
>>> conflict resolution is done. The same can be done here.
>>>
>>>
>>>
>>>
>>> MVCC (which is what Iceberg tries to implement) requires a total
>>> ordering of snapshots. Also the snapshots need to be non-conflicting. I
>>> really don't see how any metadata data structures can solve this without an
>>> outside coordinator.
>>>
>>> Consider this:
>>>
>>> Snapshot 0: (K,A) = 1
>>> Job X: UPDATE K SET A=A+1
>>> Job Y: UPDATE K SET A=10
>>>
>>> What should the final value of A be and who decides ?
>>>
>>>
>>>
>>>> By single writer, I don't mean single process, I mean multiple
>>>> coordinated processes like Spark executors coordinated by Spark driver. The
>>>> coordinator ensures that the data is pre-partitioned on
>>>> each executor, and the coordinator commits the snapshot.
>>>>
>>>> Note however that single writer job/multiple concurrent reader jobs is
>>>> perfectly feasible, i.e. it shouldn't be a problem to write from a Spark
>>>> job and read from multiple Dremio queries concurrently (for example)
>>>>
>>>
>>> :D This is still "single process" from my perspective. That process may
>>> be coordinating other processes to do distributed work but ultimately it is
>>> a single process.
>>>
>>>
>>> Fair enough
>>>
>>>
>>>
>>>> I'm not sure what you mean exactly. If we can't enforce uniqueness we
>>>> shouldn't assume it.
>>>>
>>>
>>> I disagree. We can specify that as a requirement and state that you'll
>>> get unintended consequences if you provide your own keys and don't maintain
>>> this.
>>>
>>>
>>> There's no need for unintended consequences, we can specify consistent
>>> behaviour (and I believe the document says what that is)
>>>
>>>
>>>
>>>
>>>> We do expect that most of the time the natural key is unique, but the
>>>> eager and lazy with natural key designs can handle duplicates
>>>> consistently. Basically it's not a problem to have duplicate natural
>>>> keys, everything works fine.
>>>>
>>>
>>> That heavily depends on how things are implemented. For example, we may
>>> write a bunch of code that generates internal data structures based on this
>>> expectation. If we have to support duplicate matches, all of sudden we can
>>> no longer size various data structures to improve performance and may be
>>> unable to preallocate memory associated with a guaranteed completion.
>>>
>>>
>>> Again we need to operate on the assumption that this is a large scale
>>> distributed compute/remote storage scenario. Key matching is done with
>>> shuffles with data movement across the network, such optimizations would
>>> really have little impact on overall performance. Not to mention that most
>>> query engines would already optimize the shuffle already as much as it can
>>> be optimized.
>>>
>>> It is true that if actual duplicate keys would make the key matching
>>> join (anti-join) somewhat more expensive, however it can be done in such a
>>> way that if the keys are in practice unique the join is as efficient as it
>>> can be.
>>>
>>>
>>>
>>> Let me try and clarify each point:
>>>>
>>>> - lookup for query or update on a non-(partition/bucket/sort) key
>>>> predicate implies scanning large amounts of data - because these are the
>>>> only data structures that can narrow down the lookup, right ? One could
>>>> argue that the min/max index (file skipping) can be applied to any column,
>>>> but in reality if that column is not sorted the min/max intervals can have
>>>> huge overlaps so it may be next to useless.
>>>> - remote storage - this is a critical architecture decision -
>>>> implementations on local storage imply a vastly different design for the
>>>> entire system, storage and compute.
>>>> - deleting single records per snapshot is unfeasible in eager but also
>>>> particularly in the lazy design: each deletion creates a very small
>>>> snapshot. Deleting 1 million records one at a time would create 1 million
>>>> small files, and 1 million RPC calls.
>>>>
>>>
>>> Why is this unfeasible? If I have a dataset of 100mm files including 1mm
>>> small files, is that a major problem? It seems like your usecase isn't one
>>> where you want to support single record deletes but it is definitely
>>> something important to many people.
>>>
>>>
>>> 100 mm total files or 1 mm files per dataset is definitely a problem on
>>> HDFS, and I believe on S3 too. Single key delete would work just fine, but
>>> it's simply not optimal to do that on remote storage. This is a very well
>>> known problem with HDFS, and one of the very reasons to have something like
>>> Iceberg in the first place.
>>>
>>> Basically the users would be able to do single key mutation, but it's
>>> not the use case we should be optimizing for, but it's really not advisable.
>>>
>>>
>>>
>>>
>>>> Eager is conceptually just lazy + compaction done, well, eagerly. The
>>>> logic for both is exactly the same, the trade-off is just that with eager
>>>> you implicitly compact every time so that you don't do any work on read,
>>>> while with lazy
>>>> you want to amortize the cost of compaction over multiple snapshots.
>>>>
>>>> Basically there should be no difference between the two conceptually,
>>>> or with regard to keys, etc. The only difference is some mechanics in
>>>> implementation.
>>>>
>>>
>>> I think you have deconstruct the problem too much to say these are the
>>> same (or at least that is what I'm starting to think given this thread). It
>>> seems like real world implementation decisions (per our discussion here)
>>> are in conflict. For example, you just argued against having a 1mm
>>> arbitrary mutations but I think that is because you aren't thinking about
>>> things over time with a delta implementation. Having 10,000 mutations a day
>>> where we do delta compaction once a week
>>>
>>> and local file mappings (key to offset sparse bitmaps) seems like it
>>> could result in very good performance in a case where we're mutating small
>>> amounts of data. In this scenario, you may not do major compaction ever
>>> unless you get to a high enough percentage of records that have been
>>> deleted in the original dataset. That drives a very different set of
>>> implementation decisions from a situation where you're trying to restate an
>>> entire partition at once.
>>>
>>>
>>> We operate on 1 billion mutations per day at least. This is the problem
>>> Iceberg wants to solve, I believe it's stated upfront. 10000/day is not a
>>> big data problem. It can be done fairly trivially and it would be
>>> supported, but there's not much point in extra optimizing for this use case
>>> I believe.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to