Re: [DISCUSS] UUID type

parth brahmbhatt Fri, 30 Jul 2021 00:09:44 -0700

I am personally against UUID that does not guarantee at the spec level that
they are unique across something. Even if the spec could guarantee that, it
feels like we are trying to define a type for what should be a constraint.
I would rather remove support for UUID and let the engines do coercion when
needed but invest in actually adding a constraint definition framework at
spec level so we can define constraints like "Column x is unique at
partition level".


Thanks
Parth

On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <[email protected]>
wrote:

> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
> type. Which engines are you thinking of that have a native UUID type
> besides the Presto derivatives and support Iceberg?
>
> I agree that Trino should expose a UUID type on top of Iceberg tables. All
> the user experience things that you are describing as important (compact
> storage, friendly display, ddl, clean literals) are possible without it
> being a first class type in Iceberg using a trino specific property.
>
> I don't really have a strong opinion about UUID. In general, type bloat is
> probably just a part of this kind of project. Generally, CHAR(X) and
> VARCHAR(X) feel like much bigger concerns given that they exist in all of
> the engines but not Iceberg--especially when we start talking about views.
>
> Some of this argues for physical vs logical type abstraction. (Something
> that was always challenging in Parquet but also helped to resolve how these
> types are managed in engines that don't support them.)
>
> thanks,
> Jacques
>
> PS: Funny aside, the bloat on an ip address is actually worse than a UUID,
> right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. UUID
> 36/16 => 125% bloat.
>
> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <[email protected]> wrote:
>
>> I don't think this is just a problem in Trino.
>>
>> If there is no UUID type, then a user must choose between a 36-byte
>> string and a 16-byte binary. That's not a good choice to force people into.
>> If someone chooses binary, then it's harder to work with rows and construct
>> queries even though there is a standard representation for UUIDs. To avoid
>> the user headache, people will probably choose to store values as strings.
>> Using a string would mean that more than half the value is needlessly
>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>> entire value. And since engines don't know what's in the string, the full
>> value must be used in comparison, which is extra work and extra space.
>>
>> Inflated values may not be a problem in some cases. IPv4 addresses are
>> one case where you could argue that it doesn't matter very much that they
>> are typically stored as strings. But I expect the use of UUIDs to be common
>> for ID columns because you can generate them without coordination (unlike
>> an incrementing ID) and that's a concern because the use as an ID makes
>> them likely to be join keys.
>>
>> If we want the values to be stored as 16-byte fixed, then we need to make
>> it easy to get the expected string representation in and out, just like we
>> do with date/time types. I don't think that's specific to any engine.
>>
>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <[email protected]>
>> wrote:
>>
>>> I think points 1&2 don't really apply since a fixed width binary already
>>> covers those properties.
>>>
>>> It seems like this isn't really a concern of iceberg but rather a
>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>> be inclined to say that trino should just use custom metadata and a fixed
>>> binary type. That way you still have the desired ux without exposing those
>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>> imo.
>>>
>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I agree with Ryan, that it takes some precautions before one can assume
>>>> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
>>>> at all.
>>>> After all, this is just a primitive type, which is commonly used for
>>>> certain things, but "commonly" doesn't mean "always".
>>>>
>>>> The advantages of having a dedicated type are on 3 layers.
>>>> The compact representation in the file, and compact representation in
>>>> memory in the query engine are the ones mentioned above.
>>>>
>>>> The third layer is the usability. Seeing a UUID column i know what
>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>> need for casting to varchar.
>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>> compact representation.
>>>>
>>>> Thus i think it would be good to have them.
>>>>
>>>> Best
>>>> PF
>>>>
>>>>
>>>>
>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> The original reason why I added UUID to the spec was that I thought
>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>> fields and how we might do something similar in Iceberg.
>>>>>
>>>>> The reason we have thought about removing UUID is that there aren't as
>>>>> many opportunities to take advantage of UUIDs as I thought. My original
>>>>> assumption was that we could do things like bucket on UUID fields or 
>>>>> assume
>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>> invest in support for UUID.
>>>>>
>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>> some of these things with the row identifier fields. Engines can assume
>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>> ensure lots of partition split locations (this is really important for
>>>>> Spark).
>>>>>
>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>> could really help engines as long as they can keep the values as
>>>>> fixed-width binary.
>>>>>
>>>>> I could go either way on this. I think it is valuable to have a
>>>>> compact representation for UUIDs rather than using the string
>>>>> representation. But that will require investing in the type and building
>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>> I think it may be worth keeping and investing in.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <[email protected]> wrote:
>>>>>
>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end.
>>>>>> I think It is more about user experience, whether the conversion is done 
>>>>>> at
>>>>>> the user side or Iceberg and engine side. Many people just store UUID as 
>>>>>> a
>>>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID 
>>>>>> type,
>>>>>> Iceberg can optimize this common use case internally for users. There 
>>>>>> might
>>>>>> be some other benefits I overlooked, but maybe the complication 
>>>>>> introduced
>>>>>> by this type does not really justify the slightly better user 
>>>>>> experience. I
>>>>>> am also on the fence about it.
>>>>>>
>>>>>> -Jack Ye
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> What specific arguments are there for it being a first class type
>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or 
>>>>>>> an
>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed 
>>>>>>> width
>>>>>>> binary seems to cover the cases I see in terms of actual functionality 
>>>>>>> in
>>>>>>> the iceberg libraries or engines…
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <[email protected]> wrote:
>>>>>>>
>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yan
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Joshua,
>>>>>>>>>
>>>>>>>>> I do not have a strong preference about the UUID type, but I would
>>>>>>>>> like the highlight, that the type is handled inconsistently in 
>>>>>>>>> Iceberg with
>>>>>>>>> different file formats. (See:
>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>
>>>>>>>>> If we keep the type, it would be good to standardize the handling
>>>>>>>>> in every file format.
>>>>>>>>>
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi.
>>>>>>>>>>
>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>> seems to have been some discussion about removing it? I could not 
>>>>>>>>>> find the
>>>>>>>>>> original discussion, but a reference to the discussion can be found 
>>>>>>>>>> here (
>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>
>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>
>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>> supported
>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>
>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [DISCUSS] UUID type

Reply via email to