Re: [DISCUSS] UUID type

2021-07-30 Thread parth brahmbhatt
I am personally against UUID that does not guarantee at the spec level that
they are unique across something. Even if the spec could guarantee that, it
feels like we are trying to define a type for what should be a constraint.
I would rather remove support for UUID and let the engines do coercion when
needed but invest in actually adding a constraint definition framework at
spec level so we can define constraints like "Column x is unique at
partition level".

Thanks
Parth

On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau 
wrote:

> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
> type. Which engines are you thinking of that have a native UUID type
> besides the Presto derivatives and support Iceberg?
>
> I agree that Trino should expose a UUID type on top of Iceberg tables. All
> the user experience things that you are describing as important (compact
> storage, friendly display, ddl, clean literals) are possible without it
> being a first class type in Iceberg using a trino specific property.
>
> I don't really have a strong opinion about UUID. In general, type bloat is
> probably just a part of this kind of project. Generally, CHAR(X) and
> VARCHAR(X) feel like much bigger concerns given that they exist in all of
> the engines but not Iceberg--especially when we start talking about views.
>
> Some of this argues for physical vs logical type abstraction. (Something
> that was always challenging in Parquet but also helped to resolve how these
> types are managed in engines that don't support them.)
>
> thanks,
> Jacques
>
> PS: Funny aside, the bloat on an ip address is actually worse than a UUID,
> right? IPv4 = 4 bytes. IPv4 String = 15 bytes 15/4 => 275% bloat. UUID
> 36/16 => 125% bloat.
>
> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue  wrote:
>
>> I don't think this is just a problem in Trino.
>>
>> If there is no UUID type, then a user must choose between a 36-byte
>> string and a 16-byte binary. That's not a good choice to force people into.
>> If someone chooses binary, then it's harder to work with rows and construct
>> queries even though there is a standard representation for UUIDs. To avoid
>> the user headache, people will probably choose to store values as strings.
>> Using a string would mean that more than half the value is needlessly
>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>> entire value. And since engines don't know what's in the string, the full
>> value must be used in comparison, which is extra work and extra space.
>>
>> Inflated values may not be a problem in some cases. IPv4 addresses are
>> one case where you could argue that it doesn't matter very much that they
>> are typically stored as strings. But I expect the use of UUIDs to be common
>> for ID columns because you can generate them without coordination (unlike
>> an incrementing ID) and that's a concern because the use as an ID makes
>> them likely to be join keys.
>>
>> If we want the values to be stored as 16-byte fixed, then we need to make
>> it easy to get the expected string representation in and out, just like we
>> do with date/time types. I don't think that's specific to any engine.
>>
>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau 
>> wrote:
>>
>>> I think points 1&2 don't really apply since a fixed width binary already
>>> covers those properties.
>>>
>>> It seems like this isn't really a concern of iceberg but rather a
>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>> be inclined to say that trino should just use custom metadata and a fixed
>>> binary type. That way you still have the desired ux without exposing those
>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>> imo.
>>>
>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen 
>>> wrote:
>>>
 Hi,

 I agree with Ryan, that it takes some precautions before one can assume
 uniqueness of UUID values, and that this shouldn't be any special for UUIDs
 at all.
 After all, this is just a primitive type, which is commonly used for
 certain things, but "commonly" doesn't mean "always".

 The advantages of having a dedicated type are on 3 layers.
 The compact representation in the file, and compact representation in
 memory in the query engine are the ones mentioned above.

 The third layer is the usability. Seeing a UUID column i know what
 values i can expect, so it's more descriptive than `id char(36)`.
 This also means i can CREATE TABLE ... AS SELECT uuid(),  without
 need for casting to varchar.
 It also removes temptation of casting uuid to varbinary to achieve
 compact representation.

 Thus i think it would be good to have them.

 Best
 PF



 On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue  wrote:

> The original reason why I added UUID to the spec was that I thought
> there would be opportunities to take advantage of 

Re: [VOTE] Graduate to a top-level project

2020-05-12 Thread parth brahmbhatt
+1

On Tue, May 12, 2020 at 3:31 PM Anton Okolnychyi
 wrote:

> +1 for graduation
>
> On 12 May 2020, at 15:30, Ryan Blue  wrote:
>
> +1
>
> Jacques, I agree. I'll make sure to let you know about the IPMC vote
> because we'd love to have your support there as well.
>
> On Tue, May 12, 2020 at 3:02 PM Jacques Nadeau  wrote:
>
>> I'm +1.
>>
>> (I think that is non-binding here but binding at the incubator level)
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Tue, May 12, 2020 at 2:35 PM Romin Parekh 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, May 12, 2020 at 2:32 PM Owen O'Malley 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, May 12, 2020 at 2:16 PM Ryan Blue  wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I propose that the Iceberg community should petition to graduate from
>>>>> the Apache Incubator to a top-level project.
>>>>>
>>>>> Here is the draft board resolution:
>>>>>
>>>>> Establish the Apache Iceberg Project
>>>>>
>>>>> WHEREAS, the Board of Directors deems it to be in the best interests of
>>>>> the Foundation and consistent with the Foundation's purpose to establish
>>>>> a Project Management Committee charged with the creation and maintenance
>>>>> of open-source software, for distribution at no charge to the public,
>>>>> related to managing huge analytic datasets using a standard at-rest
>>>>> table format that is designed for high performance and ease of use..
>>>>>
>>>>> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
>>>>> (PMC), to be known as the "Apache Iceberg Project", be and hereby is
>>>>> established pursuant to Bylaws of the Foundation; and be it further
>>>>>
>>>>> RESOLVED, that the Apache Iceberg Project be and hereby is responsible
>>>>> for the creation and maintenance of software related to managing huge
>>>>> analytic datasets using a standard at-rest table format that is designed
>>>>> for high performance and ease of use; and be it further
>>>>>
>>>>> RESOLVED, that the office of "Vice President, Apache Iceberg" be and
>>>>> hereby is created, the person holding such office to serve at the
>>>>> direction of the Board of Directors as the chair of the Apache Iceberg
>>>>> Project, and to have primary responsibility for management of the
>>>>> projects within the scope of responsibility of the Apache Iceberg
>>>>> Project; and be it further
>>>>>
>>>>> RESOLVED, that the persons listed immediately below be and hereby are
>>>>> appointed to serve as the initial members of the Apache Iceberg Project:
>>>>>
>>>>>  * Anton Okolnychyi 
>>>>>  * Carl Steinbach   
>>>>>  * Daniel C. Weeks  
>>>>>  * James R. Taylor  
>>>>>  * Julien Le Dem
>>>>>  * Owen O'Malley
>>>>>  * Parth Brahmbhatt 
>>>>>  * Ratandeep Ratti  
>>>>>  * Ryan Blue
>>>>>
>>>>> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Ryan Blue be appointed to
>>>>> the office of Vice President, Apache Iceberg, to serve in accordance
>>>>> with and subject to the direction of the Board of Directors and the
>>>>> Bylaws of the Foundation until death, resignation, retirement, removal
>>>>> or disqualification, or until a successor is appointed; and be it
>>>>> further
>>>>>
>>>>> RESOLVED, that the Apache Iceberg Project be and hereby is tasked with
>>>>> the migration and rationalization of the Apache Incubator Iceberg
>>>>> podling; and be it further
>>>>>
>>>>> RESOLVED, that all responsibilities pertaining to the Apache Incubator
>>>>> Iceberg podling encumbered upon the Apache Incubator PMC are hereafter
>>>>> discharged.
>>>>>
>>>>> Please vote in the next 72 hours.
>>>>>
>>>>> [ ] +1 Petition the IPMC to graduate to top-level project
>>>>> [ ] +0
>>>>> [ ] -1 Wait to graduate because . . .
>>>>> --
>>>>> Ryan Blue
>>>>>
>>>>
>>>
>>> --
>>> Thanks,
>>> Romin
>>>
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


Re: [VOTE] Release Apache Iceberg 0.8.0-incubating RC1

2020-04-29 Thread parth brahmbhatt
+1 once we apply the patch.

The patch works with presto but it breaks iceberg build. We need to add a
typecast to FileIO in TableMetadataParserTest here
<https://github.com/Parth-Brahmbhatt/incubator-iceberg/blob/presto-patch/core/src/test/java/org/apache/iceberg/TableMetadataParserTest.java#L78>
.

Thanks
Parth



On Wed, Apr 29, 2020 at 2:19 PM Ryan Blue  wrote:

> Thanks, James. Is it possible for you to test with this patch?
> https://patch-diff.githubusercontent.com/raw/apache/incubator-iceberg/pull/986.patch
>
> Those methods were removed, and it should be easy for Presto to update.
> That said, I don't like breaking downstream projects so it would be nice to
> fix it. I think this patch might.
>
> On Wed, Apr 29, 2020 at 1:12 PM James Taylor 
> wrote:
>
>> Verified signature, checksum, rat checks, build, and unit tests. All
>> looked good. Tried building latest prestosql master with proposed new
>> iceberg release, but received these compilation errors:
>>
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile
>> (default-compile) on project presto-iceberg: Compilation failure:
>> Compilation failure:
>> [ERROR]
>> /Users/jamestaylor/dev/presto/presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/IcebergMetadata.java:[359,34]
>> method newTableMetadata in class org.apache.iceberg.TableMetadata cannot be
>> applied to given types;
>> [ERROR]   required:
>> org.apache.iceberg.Schema,org.apache.iceberg.PartitionSpec,java.lang.String,java.util.Map
>> [ERROR]   found:
>> org.apache.iceberg.TableOperations,org.apache.iceberg.Schema,org.apache.iceberg.PartitionSpec,java.lang.String,com.google.common.collect.ImmutableMap
>> [ERROR]   reason: actual and formal argument lists differ in length
>> [ERROR]
>> /Users/jamestaylor/dev/presto/presto-iceberg/src/main/java/io/prestosql/plugin/iceberg/HiveTableOperations.java:[301,44]
>> no suitable method found for
>> read(io.prestosql.plugin.iceberg.HiveTableOperations,org.apache.iceberg.io.InputFile)
>> [ERROR] method
>> org.apache.iceberg.TableMetadataParser.read(org.apache.iceberg.io.FileIO,java.lang.String)
>> is not applicable
>> [ERROR]   (argument mismatch;
>> io.prestosql.plugin.iceberg.HiveTableOperations cannot be converted to
>> org.apache.iceberg.io.FileIO)
>> [ERROR] method
>> org.apache.iceberg.TableMetadataParser.read(org.apache.iceberg.io.FileIO,org.apache.iceberg.io.InputFile)
>> is not applicable
>> [ERROR]   (argument mismatch;
>> io.prestosql.plugin.iceberg.HiveTableOperations cannot be converted to
>> org.apache.iceberg.io.FileIO)
>>
>> Is this a showstopper? Probably a simple PR in presto would fix it, but
>> unclear if b/w compatibility is a goal or not.
>>
>> Thanks,
>> James
>>
>> On Tue, Apr 28, 2020 at 1:40 PM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> I propose the following RC to be released as official Apache Iceberg
>>> 0.8.0-incubating release.
>>>
>>> The commit id is 4c2dd0ac2c832cc425b33d3b578025fa4e295392
>>> * This corresponds to the tag: apache-iceberg-0.8.0-incubating-rc1
>>> *
>>> https://github.com/apache/incubator-iceberg/commits/apache-iceberg-0.8.0-incubating-rc1
>>> * https://github.com/apache/incubator-iceberg/tree/4c2dd0ac
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc1/
>>>
>>> You can find the KEYS file here:
>>> * https://dist.apache.org/repos/dist/dev/incubator/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged in Nexus. The Maven repository
>>> URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1005/
>>>
>>> This release contains many bug fixes and several new features:
>>> * Actions to remove orphaned files and to optimize metadata for query
>>> performance
>>> * Support for ORC data files
>>> * Snapshot cherry-picking
>>> * Incremental scan planning based on table history
>>> * In and notIn expressions
>>> * An InputFormat for writing MR jobs
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>>
>>> [ ] +1 Release this as Apache Iceberg 0.8.0-incubating
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>>
>>> --
>>> Ryan Blue
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [VOTE] Release Apache Iceberg 0.7.0-incubating RC4

2019-10-21 Thread parth brahmbhatt
+1(binding)

All checks passed and presto smoke tests pass as well.


On Mon, Oct 21, 2019 at 3:13 PM Daniel Weeks  wrote:

> +1
>
> Verified sig, sum, license, build, and tests
>
> On Mon, Oct 21, 2019 at 2:14 PM Ryan Blue 
> wrote:
>
>> +1 (binding)
>>
>> Ran release checks, validated metadata tables, time-travel queries, and
>> SQL in 2.4.4 spark-shell with iceberg-spark-runtime.
>>
>> On Mon, Oct 21, 2019 at 11:35 AM Ted Gooch 
>> wrote:
>>
>>> Looks like Anton captured it:
>>> https://github.com/apache/incubator-iceberg/issues/568
>>>
>>> On Mon, Oct 21, 2019 at 11:31 AM Bowen Li  wrote:
>>>
 Thanks Ted. I just subscribed to dev ml and didn't get that.

 If this is the officially desired process to validate rc, maybe it can
 be part of a new "developer" section on Iceberg's website or wiki?


 On Mon, Oct 21, 2019 at 11:00 AM Ted Gooch 
 wrote:

> Ryan sent out this for RC1:
>
> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201910.mbox/%3ccao4re1kcsrlf1azwq0efshamme+xxdb275_z9wvlv1hdzsy...@mail.gmail.com%3e
>
> On Mon, Oct 21, 2019 at 10:55 AM Bowen Li  wrote:
>
>> Hi everyone,
>>
>> I'm new to the Iceberg community. People are mentioning the "7 steps
>> of release validation" above, I wonder what they mean in the Iceberg
>> community (I found each Apache community is different on the process of
>> release validation)? I couldn't find them anywhere on iceberg repo or
>> website. Can anyone help to point them out?
>>
>> Thanks,
>> Bowen
>>
>> On Mon, Oct 21, 2019 at 2:45 AM Thippana Vamsi Kalyan <
>> va...@dremio.com> wrote:
>>
>>> +1
>>> (downloaded, verified license, verified checksum, verified
>>> signature, built it, ran unit tests, manually tested using spark, built 
>>> a
>>> test application successfully using maven repository)
>>>
>>> On Mon, Oct 21, 2019 at 9:00 AM David Christle <
>>> dchris...@linkedin.com> wrote:
>>>
 +1

 Successfully ran all seven steps.

 -David

 On 10/20/19, 7:08 PM, "陈俊杰"  wrote:

 +1

 Ran all 7 steps successfully.

 Anton Okolnychyi 
 于2019年10月20日周日 下午11:24写道:
 >
 > +1
 >
 > All 7 steps were executed successfully.
 >
 > - Anton
 >
 > On 19 Oct 2019, at 20:42, John Zhuge 
 wrote:
 >
 > +1
 >
 > - Passed all 7 steps of release validation
 > - Integrated into downstream Spark 2.3 and 2.1 branches and
 passed integration tests
 >
 > On Fri, Oct 18, 2019 at 5:14 PM Ryan Blue 
 wrote:
 >>
 >> Hi everyone,
 >>
 >> I propose the following RC to be released as official Apache
 Iceberg 0.7.0-incubating release.
 >>
 >> The commit id is 9c81babac65351f7aa21dd878f01c5c81ae304af
 >> * This corresponds to the tag:
 apache-iceberg-0.7.0-incubating-rc4
 >> *
 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Ftree%2Fapache-iceberg-0.7.0-incubating-rc4data=02%7C01%7Cdchristle%40linkedin.com%7C102df30dae2a4e96e6ad08d755cb8a8e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637072205004837316sdata=XXOZCzGm1Ubcx79guqL1Tjp5bS242z9XfJmgWPR9ZG4%3Dreserved=0
 >> *
 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Ftree%2F9c81babac65351f7aa21dd878f01c5c81ae304afdata=02%7C01%7Cdchristle%40linkedin.com%7C102df30dae2a4e96e6ad08d755cb8a8e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637072205004837316sdata=32dm0VYMApEmyy6lsd0Hzaq7feAOSIsE%2F8wt44KORRY%3Dreserved=0
 >>
 >> The release tarball, signature, and checksums are here:
 >> *
 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fincubator%2Ficeberg%2Fapache-iceberg-0.7.0-incubating-rc4%2Fdata=02%7C01%7Cdchristle%40linkedin.com%7C102df30dae2a4e96e6ad08d755cb8a8e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637072205004837316sdata=4vKe5LUQ5eJXDHT%2BGKrjArwLxkvYlgSKctEcLn5Q7TI%3Dreserved=0
 >>
 >> You can find the KEYS file here:
 >> *
 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fincubator%2Ficeberg%2FKEYSdata=02%7C01%7Cdchristle%40linkedin.com%7C102df30dae2a4e96e6ad08d755cb8a8e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637072205004837316sdata=HteERgfdQ9jz2Q1FZGEHuUtDfChe%2But%2FMng1xhzksXA%3Dreserved=0
 >>
 >> Convenience binary artifacts are staged in Nexus. The Maven
 repository