Re: [DISCUSS] UUID type

Piotr Findeisen Mon, 13 Sep 2021 04:30:26 -0700

Hi,

It seems we converged here that UUID should remain included.
I read this as a consensus reached, but it may be subjective. Did we
objectively reached consensus on this?


>From Iceberg project perspective there isn't anything to do, as UUID
already *is* part of the spec (
https://iceberg.apache.org/spec/#schemas-and-data-types).
Trino Iceberg PR adding support for UUID
https://github.com/trinodb/trino/pull/8747 was pending merge while this
conversation has been ongoing.

Best,
PF



On Mon, Aug 2, 2021 at 6:22 AM Kyle B <[email protected]> wrote:

> Hi Ryan and all,
>
> That sounds like a reasonable reason to leave IP address types out. In my
> experience, dedicated IP address types are mostly found in logging tools
> and other things for sysadmins / DevOps etc.
>
> When querying data with IP addresses, I’ve seen it done quite a lot (eg
> security reasons) but usually stored as string or manipulated in a UDF.
> They’re not commonly supported types.
>
> I would also draw the line at UUID types.
>
> - Kyle Bendickson
>
> On Jul 30, 2021, at 3:15 PM, Ryan Blue <[email protected]> wrote:
>
> 
> Jacques, you make some good points here. I think my argument about
> usability leading to performance issues is a stronger argument for engines
> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
> chooses to use a string in an engine that doesn't have a UUID type.
>
> Another thing to consider is cross-engine support. If Iceberg removes
> UUID, then Trino would probably translate to fixed[16]. That results in a
> table that's difficult to query in other engines, where people would
> probably choose to store the data as a string. On the other hand, if
> Iceberg keeps the UUID type then integrations would simply translate to the
> UUID string representation before passing data to the other engines.
> While the engines would be using 36-byte values in join keys, the user
> experience issue is fixed and the data is more compact on disk and in
> Iceberg's bounds metadata.
>
> While having a UUID type in Iceberg can't really help engines that don't
> support UUID take advantage of the type at runtime, it does seem slightly
> better to have the UUID type in general since at least one engine supports
> it and it provides the expected user experience with a compact
> representation.
>
> IPv4 addresses are a good thing to think about as well, since most of the
> same arguments apply. If we keep the UUID type, should we also add IPv4 or
> IPv6 types? I would probably draw the line at UUID because it helps in
> joins, which are an important operation. IPv4 representations aren't that
> big of an inconvenience unless you need to do IP manipulation, which is
> typically in a UDF and not the query engine. And you can always keep both
> representations in a table fairly inexpensively. Does this sound like a
> valid rationale for having UUID but not IP types?
>
> Ryan
>
> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <[email protected]>
> wrote:
>
>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>> type. Which engines are you thinking of that have a native UUID type
>> besides the Presto derivatives and support Iceberg?
>>
>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>> All the user experience things that you are describing as important
>> (compact storage, friendly display, ddl, clean literals) are possible
>> without it being a first class type in Iceberg using a trino specific
>> property.
>>
>> I don't really have a strong opinion about UUID. In general, type bloat
>> is probably just a part of this kind of project. Generally, CHAR(X) and
>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>> the engines but not Iceberg--especially when we start talking about views.
>>
>> Some of this argues for physical vs logical type abstraction. (Something
>> that was always challenging in Parquet but also helped to resolve how these
>> types are managed in engines that don't support them.)
>>
>> thanks,
>> Jacques
>>
>> PS: Funny aside, the bloat on an ip address is actually worse than a
>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>> UUID 36/16 => 125% bloat.
>>
>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <[email protected]> wrote:
>>
>>> I don't think this is just a problem in Trino.
>>>
>>> If there is no UUID type, then a user must choose between a 36-byte
>>> string and a 16-byte binary. That's not a good choice to force people into.
>>> If someone chooses binary, then it's harder to work with rows and construct
>>> queries even though there is a standard representation for UUIDs. To avoid
>>> the user headache, people will probably choose to store values as strings.
>>> Using a string would mean that more than half the value is needlessly
>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>> entire value. And since engines don't know what's in the string, the full
>>> value must be used in comparison, which is extra work and extra space.
>>>
>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>> one case where you could argue that it doesn't matter very much that they
>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>> for ID columns because you can generate them without coordination (unlike
>>> an incrementing ID) and that's a concern because the use as an ID makes
>>> them likely to be join keys.
>>>
>>> If we want the values to be stored as 16-byte fixed, then we need to
>>> make it easy to get the expected string representation in and out, just
>>> like we do with date/time types. I don't think that's specific to any
>>> engine.
>>>
>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <[email protected]>
>>> wrote:
>>>
>>>> I think points 1&2 don't really apply since a fixed width binary
>>>> already covers those properties.
>>>>
>>>> It seems like this isn't really a concern of iceberg but rather a
>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>> binary type. That way you still have the desired ux without exposing those
>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>> imo.
>>>>
>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>> for UUIDs at all.
>>>>> After all, this is just a primitive type, which is commonly used for
>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>
>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>> The compact representation in the file, and compact representation in
>>>>> memory in the query engine are the ones mentioned above.
>>>>>
>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>>> need for casting to varchar.
>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>> compact representation.
>>>>>
>>>>> Thus i think it would be good to have them.
>>>>>
>>>>> Best
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>
>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>> as many opportunities to take advantage of UUIDs as I thought. My 
>>>>>> original
>>>>>> assumption was that we could do things like bucket on UUID fields or 
>>>>>> assume
>>>>>> that a UUID field has a high NDV. But that's not necessarily the case 
>>>>>> with
>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>> invest in support for UUID.
>>>>>>
>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>> ensure lots of partition split locations (this is really important for
>>>>>> Spark).
>>>>>>
>>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>>> could really help engines as long as they can keep the values as
>>>>>> fixed-width binary.
>>>>>>
>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>> compact representation for UUIDs rather than using the string
>>>>>> representation. But that will require investing in the type and building
>>>>>> support in engines that won't take advantage of it. If Trino can use 
>>>>>> this,
>>>>>> I think it may be worth keeping and investing in.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <[email protected]> wrote:
>>>>>>
>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end.
>>>>>>> I think It is more about user experience, whether the conversion is 
>>>>>>> done at
>>>>>>> the user side or Iceberg and engine side. Many people just store UUID 
>>>>>>> as a
>>>>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID 
>>>>>>> type,
>>>>>>> Iceberg can optimize this common use case internally for users. There 
>>>>>>> might
>>>>>>> be some other benefits I overlooked, but maybe the complication 
>>>>>>> introduced
>>>>>>> by this type does not really justify the slightly better user 
>>>>>>> experience. I
>>>>>>> am also on the fence about it.
>>>>>>>
>>>>>>> -Jack Ye
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or 
>>>>>>>> an
>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed 
>>>>>>>> width
>>>>>>>> binary seems to cover the cases I see in terms of actual functionality 
>>>>>>>> in
>>>>>>>> the iceberg libraries or engines…
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yan
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Joshua,
>>>>>>>>>>
>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>
>>>>>>>>>> If we keep the type, it would be good to standardize the handling
>>>>>>>>>> in every file format.
>>>>>>>>>>
>>>>>>>>>> Thanks, Peter
>>>>>>>>>>
>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi.
>>>>>>>>>>>
>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>> seems to have been some discussion about removing it? I could not 
>>>>>>>>>>> find the
>>>>>>>>>>> original discussion, but a reference to the discussion can be found 
>>>>>>>>>>> here (
>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>
>>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>>
>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>> supported
>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>
>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>
>

Re: [DISCUSS] UUID type

Reply via email to