The original reason why I added UUID to the spec was that I thought there
would be opportunities to take advantage of UUIDs as unique values and to
optimize the use of UUIDs. I was thinking about auto-increment ID fields
and how we might do something similar in Iceberg.

The reason we have thought about removing UUID is that there aren't as many
opportunities to take advantage of UUIDs as I thought. My original
assumption was that we could do things like bucket on UUID fields or assume
that a UUID field has a high NDV. But that's not necessarily the case with
when a UUID field is a foreign key, only when it is used as an identifier
or primary key. Before Jack added tracking for row identifier fields, we
couldn't know that a UUID was unique in a table. As a result, we didn't
invest in support for UUID.

Quick aside: Now that row identifier fields are tracked, we can do some of
these things with the row identifier fields. Engines can assume that the
tuple of row identifier fields is unique in a table for join estimation.
And engines can use row identifier fields in sort keys to ensure lots of
partition split locations (this is really important for Spark).

Coming back to UUIDs, the second reason to have a UUID type is still valid:
it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8 strings
that are more than twice as large, or even worse UCS-16 Strings that are 4x
as large. Since UUIDs are likely to be used in joins, this could really
help engines as long as they can keep the values as fixed-width binary.

I could go either way on this. I think it is valuable to have a compact
representation for UUIDs rather than using the string representation. But
that will require investing in the type and building support in engines
that won't take advantage of it. If Trino can use this, I think it may be
worth keeping and investing in.

Ryan

On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <yezhao...@gmail.com> wrote:

> Yes I agree with Jacques that fixed binary is what it is in the end. I
> think It is more about user experience, whether the conversion is done at
> the user side or Iceberg and engine side. Many people just store UUID as a
> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
> Iceberg can optimize this common use case internally for users. There might
> be some other benefits I overlooked, but maybe the complication introduced
> by this type does not really justify the slightly better user experience. I
> am also on the fence about it.
>
> -Jack Ye
>
> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <jacquesnad...@gmail.com>
> wrote:
>
>> What specific arguments are there for it being a first class type besides
>> it is elsewhere? Is there some kind of optimization iceberg or an engine
>> could do if it was typed versus just a bucket of bits? Fixed width binary
>> seems to cover the cases I see in terms of actual functionality in the
>> iceberg libraries or engines…
>>
>>
>>
>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yyany...@gmail.com> wrote:
>>
>>> One conversation I used to come across regarding UUID deprecation was
>>> from https://github.com/apache/iceberg/pull/1611
>>>
>>> Thanks,
>>> Yan
>>>
>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Hi Joshua,
>>>>
>>>> I do not have a strong preference about the UUID type, but I would like
>>>> the highlight, that the type is handled inconsistently in Iceberg with
>>>> different file formats. (See:
>>>> https://github.com/apache/iceberg/issues/1881)
>>>>
>>>> If we keep the type, it would be good to standardize the handling in
>>>> every file format.
>>>>
>>>> Thanks, Peter
>>>>
>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <joshthow...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi.
>>>>>
>>>>> UUID is a current data type according to the Iceberg spec (
>>>>> https://iceberg.apache.org/spec/#primitive-types), but there seems to
>>>>> have been some discussion about removing it? I could not find the original
>>>>> discussion, but a reference to the discussion can be found here (
>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>
>>>>> I generally agree with the consensus in the Trino issue to keep UUID
>>>>> in Iceberg. To summarize…
>>>>>
>>>>> - It makes sense to keep the type now that row identifiers are
>>>>> supported
>>>>> - Some engines (Trino) have support for the UUID type
>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>
>>>>> Does anyone want to remove the type? If so, why?
>>>>
>>>>

-- 
Ryan Blue
Tabular

Reply via email to