Re: [DISCUSS][BYLAWS] Moving forward on the bylaws

2024-07-19 Thread Owen O'Malley
I meant specifically the discussion of the standard roles (eg. users,
committers, pmc, pmc chair) that are well covered in
https://www.apache.org/foundation/how-it-works/#roles

.. Owen

On Fri, Jul 19, 2024 at 10:43 AM Jack Ye  wrote:

> Thank you Owen for moving this forward, we heard you were sick, hope you
> are fully recovered now!
>
> One point regarding "referring to the Apache documentation": I am totally
> for that, but during the initial investigation, I found out that the Apache
> documentations are scattered around, and also have conflicting information.
>
> For example, regarding a "vote for committer or PMC member":
> - this new committer doc [1] writes that "A positive result is achieved by
> Consensus Approval: at least 3 +1 votes and no vetoes."
> - the Apache voting process doc [2] writes that "Votes on procedural
> issues follow the common format of majority rule unless otherwise stated.",
> and when consulting a few Apache members, most of them consider voting for
> committer or PMC member a procedural issue.
>
> Similar situations were found for other topics like description of roles
> and responsibilities, code modification, etc.
>
> I think it is a great chance for ASF in general to consolidate these
> information, especially for matters that have common guidelines in ASF that
> should be adhered to by all projects. With that, we can figure out what to
> put in the Iceberg specific bylaws, either to directly refer to ASF
> official information, or to add additional information and guidelines.
>
> Regarding sub-projects, the main reason I proposed it at the beginning was
> to allow a proper definition of release manager responsibility, since each
> sub-project is released independently. It was not intended to be tied to
> committer responsibilities.
>
> Best,
> Jack Ye
>
> [1] https://community.apache.org/newcommitter.html
> [2] https://www.apache.org/foundation/voting
>
> On Fri, Jul 19, 2024 at 10:22 AM Owen O'Malley 
> wrote:
>
>> Everyone is welcome to vote. The Iceberg PMC will have the only binding
>> votes.
>>
>> .. Owen
>>
>> On Jul 19, 2024, at 10:19, Wing Yew Poon 
>> wrote:
>>
>> 
>> Hi Owen,
>> Thanks for doing this.
>> Once you have the questions and choices, who gets to vote on them?
>> - Wing Yew
>>
>>
>> On Fri, Jul 19, 2024 at 10:07 AM Owen O'Malley 
>> wrote:
>>
>>> All,
>>>Sorry for the long pause on bylaws discussion. It was a result of
>>> wanting to avoid the long US holiday week (July 4th) and my
>>> procrastination, which was furthered by a side conversation that asked me
>>> to consider how to move forward in an Apache way.
>>>   I'd like to thank Jack for moving this to this point. One concern that
>>> I had was there were lots of discussions and decisions that were being made
>>> off of our email lists, which isn't the way that Apache should work.
>>>   For finishing this off, I'd like to come up with a set of questions
>>> that should be answered by multiple choice questions and then use single
>>> transferable vote (STV) to resolve them. STV just means that each person
>>> lists their choices in a ranked order with a formal way to resolve how the
>>> votes work.
>>>   The questions that I have heard so far are:
>>>
>>>1. Should the PMC chair be term-limited and if so, what is the
>>>period? *In my experience, this isn't necessary in most projects and
>>>is often ignored. In Hadoop, Chris Douglas was a great chair and held it
>>>for 5 years in spite of the 1 year limit.*
>>>1. No term limit
>>>   2. 1 year
>>>   3. 2 year
>>>2. What should the minimum voting period be?* I'd suggest 3 days is
>>>far better as long as it isn't abused by holding important votes over
>>>holiday weekends.*
>>>1. 3 days (72 hours)
>>>   2. 7 days
>>>3. Should we keep the section on roles or just reference the Apache
>>>documentation <https://www.apache.org/foundation/how-it-works/#roles>.
>>>*I'd suggest that we reference the Apache documentation.*
>>>4. I'd like to include a couple sentences about the different hats
>>>at Apache and that votes should be for the benefit of the project and not
>>>our employers.
>>>5. I'd like to propose that we include text to formally include
>>>censor and potential removal for disclosing sensitive information from 
>>> the
>>>private list.
>>>6. I'd li

Re: [DISCUSS][BYLAWS] Moving forward on the bylaws

2024-07-19 Thread Owen O'Malley
Everyone is welcome to vote. The Iceberg PMC will have the only binding votes. .. OwenOn Jul 19, 2024, at 10:19, Wing Yew Poon  wrote:Hi Owen,Thanks for doing this.Once you have the questions and choices, who gets to vote on them?- Wing YewOn Fri, Jul 19, 2024 at 10:07 AM Owen O'Malley <owen.omal...@gmail.com> wrote:All,   Sorry for the long pause on bylaws discussion. It was a result of wanting to avoid the long US holiday week (July 4th) and my procrastination, which was furthered by a side conversation that asked me to consider how to move forward in an Apache way.  I'd like to thank Jack for moving this to this point. One concern that I had was there were lots of discussions and decisions that were being made off of our email lists, which isn't the way that Apache should work.  For finishing this off, I'd like to come up with a set of questions that should be answered by multiple choice questions and then use single transferable vote (STV) to resolve them. STV just means that each person lists their choices in a ranked order with a formal way to resolve how the votes work.  The questions that I have heard so far are:Should the PMC chair be term-limited and if so, what is the period? In my experience, this isn't necessary in most projects and is often ignored. In Hadoop, Chris Douglas was a great chair and held it for 5 years in spite of the 1 year limit.No term limit1 year2 yearWhat should the minimum voting period be? I'd suggest 3 days is far better as long as it isn't abused by holding important votes over holiday weekends.3 days (72 hours)7 daysShould we keep the section on roles or just reference the Apache documentation. I'd suggest that we reference the Apache documentation.I'd like to include a couple sentences about the different hats at Apache and that votes should be for the benefit of the project and not our employers.I'd like to propose that we include text to formally include censor and potential removal for disclosing sensitive information from the private list.I'd like to propose branch committers. It has helped Hadoop a lot to enable people to work on development branches for large features before they are given general committership. It is better to have the branch work done at Apache and be visible than having large branches come in late in the project.Requirements for each topic (each could be consensus, lazy consensus, lazy majority, lazy 2/3's)Add committerRemove committerAdd PMCRemove PMCAccept design proposalAdd subprojectRemove subprojectRelease (can't be lazy consensus)Modifying bylawsThoughts? Missing questions?.. Owen



Re: [DISCUSS][BYLAWS] Moving forward on the bylaws

2024-07-19 Thread Owen O'Malley
One quick followup. I'd recommend against having sub-project or
specification committers. Part of being a committer is knowing what you do
and don't know. I've never seen a problem at Apache where a committer
committed something in an area where they had no expertise. Trying to
formalize those boundaries adds unnecessary organizational complexity.

.. Owen

On Fri, Jul 19, 2024 at 10:06 AM Owen O'Malley 
wrote:

> All,
>Sorry for the long pause on bylaws discussion. It was a result of
> wanting to avoid the long US holiday week (July 4th) and my
> procrastination, which was furthered by a side conversation that asked me
> to consider how to move forward in an Apache way.
>   I'd like to thank Jack for moving this to this point. One concern that I
> had was there were lots of discussions and decisions that were being made
> off of our email lists, which isn't the way that Apache should work.
>   For finishing this off, I'd like to come up with a set of questions that
> should be answered by multiple choice questions and then use single
> transferable vote (STV) to resolve them. STV just means that each person
> lists their choices in a ranked order with a formal way to resolve how the
> votes work.
>   The questions that I have heard so far are:
>
>1. Should the PMC chair be term-limited and if so, what is the period? *In
>my experience, this isn't necessary in most projects and is often ignored.
>In Hadoop, Chris Douglas was a great chair and held it for 5 years in spite
>of the 1 year limit.*
>1. No term limit
>   2. 1 year
>   3. 2 year
>2. What should the minimum voting period be?* I'd suggest 3 days is
>far better as long as it isn't abused by holding important votes over
>holiday weekends.*
>1. 3 days (72 hours)
>   2. 7 days
>3. Should we keep the section on roles or just reference the Apache
>documentation <https://www.apache.org/foundation/how-it-works/#roles>. *I'd
>suggest that we reference the Apache documentation.*
>4. I'd like to include a couple sentences about the different hats at
>Apache and that votes should be for the benefit of the project and not our
>employers.
>5. I'd like to propose that we include text to formally include censor
>and potential removal for disclosing sensitive information from the private
>list.
>6. I'd like to propose branch committers. It has helped Hadoop a lot
>to enable people to work on development branches for large features before
>they are given general committership. It is better to have the branch work
>done at Apache and be visible than having large branches come in late in
>the project.
>7. Requirements for each topic (each could be consensus, lazy
>consensus, lazy majority, lazy 2/3's)
>1. Add committer
>   2. Remove committer
>   3. Add PMC
>   4. Remove PMC
>   5. Accept design proposal
>   6. Add subproject
>   7. Remove subproject
>   8. Release (can't be lazy consensus)
>   9. Modifying bylaws
>
> Thoughts? Missing questions?
>
> .. Owen
>


[DISCUSS][BYLAWS] Moving forward on the bylaws

2024-07-19 Thread Owen O'Malley
All,
   Sorry for the long pause on bylaws discussion. It was a result of
wanting to avoid the long US holiday week (July 4th) and my
procrastination, which was furthered by a side conversation that asked me
to consider how to move forward in an Apache way.
  I'd like to thank Jack for moving this to this point. One concern that I
had was there were lots of discussions and decisions that were being made
off of our email lists, which isn't the way that Apache should work.
  For finishing this off, I'd like to come up with a set of questions that
should be answered by multiple choice questions and then use single
transferable vote (STV) to resolve them. STV just means that each person
lists their choices in a ranked order with a formal way to resolve how the
votes work.
  The questions that I have heard so far are:

   1. Should the PMC chair be term-limited and if so, what is the period? *In
   my experience, this isn't necessary in most projects and is often ignored.
   In Hadoop, Chris Douglas was a great chair and held it for 5 years in spite
   of the 1 year limit.*
   1. No term limit
  2. 1 year
  3. 2 year
   2. What should the minimum voting period be?* I'd suggest 3 days is far
   better as long as it isn't abused by holding important votes over holiday
   weekends.*
   1. 3 days (72 hours)
  2. 7 days
   3. Should we keep the section on roles or just reference the Apache
   documentation . *I'd
   suggest that we reference the Apache documentation.*
   4. I'd like to include a couple sentences about the different hats at
   Apache and that votes should be for the benefit of the project and not our
   employers.
   5. I'd like to propose that we include text to formally include censor
   and potential removal for disclosing sensitive information from the private
   list.
   6. I'd like to propose branch committers. It has helped Hadoop a lot to
   enable people to work on development branches for large features before
   they are given general committership. It is better to have the branch work
   done at Apache and be visible than having large branches come in late in
   the project.
   7. Requirements for each topic (each could be consensus, lazy consensus,
   lazy majority, lazy 2/3's)
   1. Add committer
  2. Remove committer
  3. Add PMC
  4. Remove PMC
  5. Accept design proposal
  6. Add subproject
  7. Remove subproject
  8. Release (can't be lazy consensus)
  9. Modifying bylaws

Thoughts? Missing questions?

.. Owen


Re: [Discussion] Apache Iceberg Community Guideline - Initial Version

2024-07-01 Thread Owen O'Malley
Sorry for coming into this conversation late, but I have a lot of
experience with writing the bylaws for Apache projects (Hadoop & ORC). As a
neutral third party (not working for Databricks or a cloud provider) who
has a lot of Apache experience, I'd like to offer my service as a moderator
for the discussion. I don't think it is appropriate for a small group to
come back with a finished product for a final vote, especially during the
summer when lots of people are travelling, this process should be much more
gradual and inclusive.

.. Owen

On Mon, Jul 1, 2024 at 7:21 AM Jack Ye  wrote:

> Hi everyone,
>
> Thanks for all the comments and feedback on the document, I am working
> with a few commenters on some additional changes and wording, and then will
> carry out the vote.
>
> Best,
> Jack Ye
>
> On Thu, Jun 27, 2024 at 11:02 AM Jack Ye  wrote:
>
>> To provide an update here, I have consolidated most of the comments in
>> the initial version, with the following changes:
>>
>> (1) condensed the section of roles and responsibilities, with pointers to
>> different pages in ASF and existing Iceberg project pages.
>>
>> (2) clarified voting details, regrading things like partial votes,
>> difference of voting on mailing lists vs voting on GitHub PRs
>>
>> (3) clarified the section regarding lazy consensus. There is a definition
>> difference between the ASF definition (no +1 vote needed) vs the ORC
>> definition (1 +1 vote). I renamed the ORC version as "minimum consensus"
>> instead.
>>
>> (4) updated "Modify Code" vote type to minimum consensus. This is a bit
>> different from ASF definition for code modification, but I think we are
>> coming to an agreement that the ASF definition is outdated. Minimum
>> consensus seems to make the most sense given the way we operate Iceberg so
>> far, which is basically at least 1 committer other than the author needs to
>> approve a PR before merging.
>>
>> (5) updated all decisions regarding committers and PMC members and
>> guideline updates to majority approval, following the ASF guideline on
>> voting for procedural issues.
>>
>> Let me know if there is anything else we see major disagreements with,
>> and I will organize a vote after 24 hours.
>>
>> Best,
>> Jack Ye
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 26, 2024 at 11:04 AM Jack Ye  wrote:
>>
>>> +1 for adding to the site.
>>>
>>> I am putting it as a doc for now since Google doc is easier to comment
>>> (I think?). My plan is to:
>>>
>>> (1) publish it as a PR after a vote has passed. We can do one more
>>> sanity check in the PR, but the information will be exactly as it is
>>> presented in the Google doc, maybe adding some additional links to more
>>> easily jump among the sections or to other pages in the site, fix some
>>> grammar issues that were overlooked.
>>>
>>> (2) keep a changelog within the document itself. Because we have moved
>>> the site multiple times in the past, I am not really confident that we
>>> could just track history with Git commit history, especially with such an
>>> important document. I would like to add a changelog section in the end,
>>> documenting what change has been approved when, with links to devlist
>>> discussions and votes.
>>>
>>> For how we tackle the other topics, my plan is to pass the initial
>>> version first, and then we just go through all the identified topics one by
>>> one. I have a list of all topics in the original feedback collection
>>> devlist thread.
>>>
>>> Let me know what you think about these plans!
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>> On Wed, Jun 26, 2024 at 9:04 AM Ryan Blue 
>>> wrote:
>>>
 +1 for adding this to the site once we agree on the changes.

 One thing that has been raised several times but hasn't yet been
 addressed is how we want to tackle this. Many of us have asked to review
 the additional bylaws individually and discuss the purpose and merits of
 each one. It's great to have an overall doc (much like our integrated PRs
 to give context) but I think we should start having separate discussions
 about the rationale for each bylaw to make progress.

 Ryan

 On Wed, Jun 26, 2024 at 8:57 AM Micah Kornfield 
 wrote:

> Hi Jack,
> I think it would make sense to convert this to a PR, so it can be
> version tracked in the future (and that way it avoids another review if 
> the
> intent is to transitition github)?
>
> Thanks,
> Micah
>
> On Tue, Jun 25, 2024 at 9:07 AM Jack Ye  wrote:
>
>> Hi everyone,
>>
>> Thanks for the feedback in the bylaws document discussion thread! As
>> suggested, I have removed all the topics that require further debates, 
>> and
>> created this new doc to serve as the initial version that we can review 
>> and
>> later vote.
>>
>>
>> https://docs.google.com/document/d/1S3igb5NqSlYE3dq_qRsP3X2gwhe54fx-Sxq5hqyOe6I/edit
>>
>> I will organize new devlist threads to 

Re: [DISCUSS] June board report

2024-06-15 Thread Owen O'Malley
Ryan,
   It looks good. Thanks for including the notice about Tabular/Databricks.

.. Owen

On Wed, Jun 12, 2024 at 9:52 PM Ryan Blue  wrote:

> Hi everyone,
>
> Here's my current draft board report for June. If you have anything to add
> or update, please reply and I'll amend the report.
>
> Thanks,
>
> Ryan
>
> ## Description:
> Apache Iceberg is a table format for huge analytic datasets that is
> designed
> for high performance and ease of use.
>
> ## Project Status:
> Current project status: Ongoing
> Issues for the board: None
>
> ## Membership Data:
> Apache Iceberg was founded 2020-05-19 (4 years ago)
> There are currently 27 committers and 16 PMC members in this project.
> The Committer-to-PMC ratio is roughly 7:4.
>
> Community changes, past quarter:
> - No new PMC members. Last addition was Szehon Ho on 2023-04-20.
> - No new committers. Last addition was Renjie Liu on 2024-03-06.
>
> ## Project Activity:
>
> Releases:
> - 1.5.1 was released on 2024-04-25
> - 1.5.2 was released on 2024-05-09
> - PyIceberg 0.6.1 was released on 2024-04-30
>
> PyIceberg:
> - Contributors are working to release more often
> - Improved retries for Hive catalog locking
> - Added register table support for Glue catalogs
> - Adding metadata table support (snapshots, manifests, etc.)
> - Working toward 0.7.0 release with partitioned writes and staged table
> creation
>
> Rust:
> - Implemented projection to support partition-based file pruning
> - Implemented the inclusive metrics evaluator and predicate pushdown to
> Parquet
> - Added Hive catalog support
> - Improved REST catalog with OAuth2 and custom headers
> - Added integration with DataFusion
>
> Go:
> - Working toward full expression support; added literals
>
> Iceberg Java:
> - The next Java release, 1.6.0, is targeted for release in June
>
> Specs:
> - Discussions about standardizing metadata for materialized views have
> made good
>   progress. The community decided to use existing objects rather than
> creating a
>   new combined table/view object and is working on metadata details.
> - An extension to the REST protocol for privilege GRANT and REVOKE
> operations
>   was proposed.
> - Many discussions for extending the REST protocol are ongoing, including
> adding
>   routes to plan scans, adding auth decisions, and appending data files
> - There are also discussions for v3 features, like additional types
> (variant,
>   timestampns, and others)
>
> ## Community Health:
>
> The Iceberg community continues to be healthy, with a large number of
> commits
> and individual contributors over the past quarter. Although overall commits
> decreased, the change corresponds with the number of opened PRs so the
> change is
> not a concern for health; PRs are getting reviewed.
>
> The community is formalizing design discussions and has added github
> labels and
> documented a process for making changes to community specs.
>
> The community also held the first Iceberg Summit this quarter, with 32
> sessions
> that are now available on the YouTube (https://tinyurl.com/iceberg-summit
> ).
> Community members also spoke at CoC EU.
>
> A company that employs 3 PMC members and 2 committers was acquired. The PMC
> members (2 of whom are ASF members) have been reminded to act as
> individuals,
> not as representatives of their employer, when interacting in the
> community.
> Concentrations of PMC members is a risk that the community is aware of and
> will
> note in future board reports.
>
> Other projects and announcements:
> - Trino added support for Iceberg views
> - Beam has added an Iceberg sink
> - Confluent, Terradata, and Oracle announced Iceberg support
> - Snowflake announced a new open source REST catalog project
> - Databricks released its catalog that implements the REST spec
>
> --
> Ryan Blue
> Tabular
>


Re: Call for Ryan Blue to Step Down as PMC Chair

2024-06-05 Thread Owen O'Malley
I strongly disagree with asking Ryan to step down. For those who don't know
me, I'm an Iceberg PMC member, Apache member, and
was a mentor and champion for Iceberg when it entered the Apache Incubator
.
I've never worked at either Tabular or Databricks.

Over the years, I've had a lot of discussions with Ryan about Apache in
general and Iceberg specifically. Ryan's always impressed me with his
commitment to doing the right thing for the open source communities that he
works in. In particular, I think Ryan's done an amazing job of encouraging
Iceberg's community and technology.

That said, one of the danger signs for open source projects is when a
majority of the PMC members or committers are employed by a single company.
Towards that end, I'd encourage Ryan in his next quarterly report to the
Apache Board to mention the acquisition as a risk factor for Iceberg.

On a side note, discussions about individuals on Apache projects should in
general happen on the project's private list and not in public.

.. Owen

On Wed, Jun 5, 2024 at 4:13 AM Kanou Natsukawa 
wrote:

> Hi community,
>
> I'm calling for Ryan Blue to step down as Iceberg PMC chair. With the
> recent acquisition of Tabular by Databricks [1], I believe there is a
> natural conflict of interest for him to continue to be the chair of the
> Iceberg project.
>
> Tabular's official messages will likely come and say something in the line
> of they will remain neutral, but in fact everyone knows that it is not
> possible when they have signed a contract with the company owning the
> competing project, and the contract has so much money involved.
>
> I have only contributed to Iceberg once, but I still see myself as a part
> of the community. I really like how Iceberg used to be, just a very
> well-designed table format. It started to change when Tabular was formed
> and started to do their REST catalog, but Tabular has been a small player
> in the industry that their control is in general not hurting the project.
> The startup also did many great things like py-iceberg after all, and I
> guess large companies also love the REST idea since they have the resource
> to build one, it's just not every company is Netflix or Apple. With
> Databricks, I am deeply worried about the direction of the project.
>
> I propose having someone from Apple (Russell, Anton, Yufei, Steven,
> Szehon), or Jack Ye from AWS to take the PMC chair position instead, as
> they are very active PMC members in the community, and have a much more
> neutral position to safely lead the project in the right direction.
>
> And also to other Iceberg PMC members and committers from Tabular, you
> have gained a lot of wealth from this, at this moment the best thing I hope
> you can do is please keep this project alone and out of your hands.
>
> [1] https://www.databricks.com/blog/databricks-tabular
>
> Thanks
> Natsukawa
>


Re: [PROPOSAL] Preparing first Apache Iceberg Summit

2023-09-22 Thread Owen O'Malley
It is also important to consider who is on the program committee and their affiliations. It also helps if the pc discourages sales talks (especially with propriety extensions!) They should encourage  technical ones about development and usage of the Apache project. .. OwenOn Sep 22, 2023, at 11:19, Ryan Blue  wrote:To me, this proposal is getting a bit ahead of where I'm comfortable. I was expecting this to address some of the big questions about how to run an event like this from an open source community, but it seems to be assuming that the event will happen and addresses logistics.Here's an example of what I mean: the doc suggests the different levels of sponsors, slots at those levels, and prices. But I think the main question the community has to think through before we get there --- and what I'd expect in a proposal --- is how to ensure that such an event remains commercially neutral. Because you're asking for this to be an "official" event from the community and using its trademarks, we need to think through how we want to strike that balance.I'd like to see more of a proposal around:1. Should the Iceberg community put on an event? Clearly, we like the idea of exchanging ideas, but it goes beyond that.2. How would we balance the interests of different parts of the community and why should we take that approach?We have a lot of different companies contributing and building around Iceberg. The last thing that we want to do is give the impression that the community is in any way "pay-to-play" --- that's one of this community's distinguishing features.Also, I want to disclose that I'm also associated with a vendor (Tabular). With that hat on, I think we'd happily sponsor an event like this. But with my community hat on I want to make sure we plan it out and think carefully.RyanOn Fri, Sep 22, 2023 at 2:09 AM Jean-Baptiste Onofré  wrote:Hi guys,

Finally (sorry for the long wait :)), a first formal Iceberg Summit
proposal doc is ready to be populated/reviewed:

https://docs.google.com/document/d/1Uy9-qRxLtjMWJkRXsjj94Vq3VO1Mc0wGz_bnisevNh8/edit?usp=sharing

Anyone can edit the document, so feel free to complete or ask
questions via comments.

Thanks !
Regards
JB

On Wed, Aug 23, 2023 at 11:25 PM Jean-Baptiste Onofré  wrote:
>
> It sounds great :)
>
> I will include a note in the proposal doc about that.
>
> Regards
> JB
>
> Le mer. 23 août 2023 à 14:14, Brian Olsen  a écrit :
>>
>> Out of curiosity, is anyone strongly opposed to doing antics like this for summits?
>>
>> https://youtube.com/playlist?list=PLFnr63che7wYFsknFAqisURvfm96rW0Dr
>>
>>
>> On Mon, Aug 21, 2023 at 6:58 PM Matt Topol  wrote:
>>>
>>> I don't think I'll have much time to contribute to help, but I would absolutely help if possible.
>>>
>>> That said, I'll definitely want to give a talk / speak at this summit when it happens :)
>>>
>>> On Mon, Aug 21, 2023 at 1:38 AM Jean-Baptiste Onofré  wrote:

 Hi guys,

 I'm back from vacation and I'm resuming the work on the Iceberg Summit
 proposal doc. I will share the doc asap.

 Regards
 JB

 On Wed, Jul 5, 2023 at 4:37 PM Jean-Baptiste Onofré  wrote:
 >
 > Hi everyone,
 >
 > I started a discussion on the private mailing list, and, as there are
 > no objections from the PMC members, I'm moving the thread to the dev
 > mailing list.
 >
 > I propose to organize the first Apache Iceberg Summit \o/
 >
 > For the format, I think the best option is a virtual event with a mix of:
 > 1. Dev community talks: architecture, roadmap, features, use in "products", ...
 > 2. User community talks: companies could present their use cases, best
 > practices, ...
 >
 > In terms of organization:
 > 1. no objection so far from the PMC members to use Apache Iceberg
 > Summit name. If it works for everyone, I will send a message to the
 > Apache Publicity & Marketing to get their OK for the event.
 >  2. create two committees:
 >   2.1. the Sponsoring Committee gathering companies/organizations
 > wanting to sponsor the event
 >   2.2. the Program Committee gathers folks from the Iceberg community
 > (PMC/committers/contributors) to select talks.
 >
 > My company (Dremio) will “host” the event - i.e., provide funding, a
 > conference platform, sponsor logistics, speaker training, slide
 > design, etc..
 >
 > In terms of dates, as CommunityOverCode Con NA will be in October, I
 > think January 2024 would work: it gives us time to organize smoothly,
 > promote the event, and not in a rush.
 >
 > I propose:
 > 1. to create the #summit channel on Iceberg Slack.
 > 2. I will share a preparation document with a plan proposal.
 >
 > Thoughts ?
 >
 > Regards
 > JB
-- Ryan BlueTabular


ApacheCon Iceberg BOF

2022-10-05 Thread Owen O'Malley
All,
   There is an Iceberg Birds of a Feather meet up at ApacheCon in an hour
(5:50pm CDT). Please come by and join us, if you are attending.

Thanks,
   Owen


Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Owen O'Malley
At the stripe boundaries, the bytes on disk statistics are accurate. A
stripe that is in flight, is going to be an estimate, because the
dictionaries can't be compressed until the stripe is flushed. The memory
usage will be a significant over estimate, because it includes buffers that
are allocated, but not used yet.

.. Owen

On Fri, Mar 4, 2022 at 5:23 PM Dongjoon Hyun  wrote:

> The following is merged for Apache ORC 1.7.4.
>
> ORC-1123 Add `estimationMemory` method for writer
>
> According to the Apache ORC milestone, it will be released on May 15th.
>
> https://github.com/apache/orc/milestones
>
> Bests,
> Dongjoon.
>
> On 2022/03/04 13:11:15 Yiqun Zhang wrote:
> > Hi Openinx
> >
> > Thank you for initiating this discussion. I think we can get the
> `TypeDescription` from the writer and in the `TypeDescription` we know
> which types and more precisely the maximum length of the varchar/char. This
> will help us to estimate the average width.
> >
> > Also, I agree with your suggestion, I will make a PR later to add the
> `estimateMemory` public method for Writer.
> >
> > On 2022/03/04 04:01:04 OpenInx wrote:
> > > Hi Iceberg dev
> > >
> > > As we all know,  in our current apache iceberg write path,  the ORC
> file
> > > writer cannot just roll over to a new file once its byte size reaches
> the
> > > expected threshold.  The core reason that we don't support this before
> is:
> > >   The lack of correct approach to estimate the byte size from an
> unclosed
> > > ORC writer.
> > >
> > > In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
> trying
> > > to propose an estimate approach to fix this fundamentally (Also
> enabled all
> > > those ORC writer unit tests that we disabled intentionally before).
> > >
> > > The approach is:  If a file is still unclosed , let's estimate its
> size in
> > > three steps ( PR:
> > >
> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
> > > )
> > >
> > > 1. Size of data that has been written to stripe.The value is obtained
> by
> > > summing the offset and length of the last stripe of the writer.
> > > 2. Size of data that has been submitted to the writer but has not been
> > > written to the stripe. When creating OrcFileAppender, treeWriter is
> > > obtained through reflection, and uses its estimateMemory to estimate
> how
> > > much memory is being used.
> > > 3. Data that has not been submitted to the writer, that is, the size
> of the
> > > buffer. The maximum default value of the buffer is used here.
> > >
> > > My feeling is:
> > >
> > > For the file-persisted bytes , I think using the last strip's offset
> plus
> > > its length should be correct. For the memory encoded batch vector , the
> > > TreeWriter#estimateMemory should be okay.
> > > But for the batch vector whose rows did not flush to encoded memory,
> using
> > > the batch.size shouldn't be correct. Because the rows can be any data
> type,
> > > such as Integer, Long, Timestamp, String etc. As their widths are not
> the
> > > same, I think we may need to use an average width minus the batch.size
> > > (which is row count actually).
> > >
> > > Another thing is about the `TreeWriter#estimateMemory` method,  The
> current
> > > `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
> > > `estimateMemory` method to public,  I will suggest to publish a PR to
> > > apache ORC project to expose those interfaces in
> `org.apache.orc.Writer` (
> > > see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
> > >
> > > I'd like to invite the iceberg dev to evaluate the current approach.
> Is
> > > there any other concern from the ORC experts' side ?
> > >
> > > Thanks.
> > >
> >
>


Re: Hive table compatibility for Iceberg readers

2022-01-31 Thread Owen O'Malley
On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa 
wrote:

> *2. Iceberg schema lower casing:* Before Iceberg, when users read Hive
> tables from Spark, the returned schema is lowercase since Hive stores all
> metadata in lowercase mode. If users move to Iceberg, such readers could
> break once Iceberg returns proper case schema. This feature is to add
> lowercasing for backward compatibility with existing scripts. This feature
> is added as an option and is not enabled by default.
>

This isn't quite correct. Hive lowercases top-level columns. It does not
lowercase field names inside structs.


> *3. Hive table proper casing:* conversely, we leverage the Avro schema to
> supplement the lower case Hive schema when reading Hive tables. This is
> useful if someone wants to still get proper cased schemas while still in
> the Hive mode (to be forward-compatible with Iceberg). The same flag used
> in (2) is used here.
>

Are there users of Avro schemas in Hive outside of LinkedIn? I've never
seen it used. I don't think you should tie #2 and #3 together.

Supporting default values and union types are useful extensions.

.. Owen


Re: [CWS] Re: Subject: [VOTE] Release Apache Iceberg 0.12.0 RC3

2021-08-16 Thread Owen O'Malley
Ok, after the vote, but I did:
* verified tag is same as the tar ball
* verified checksums and signatures
* built and ran the tests

My one complaint is that I get test failures that look like they are
timezone related. ORC and Parquet tests failing with timestamps 7 or 8
hours off.

.. Owen

On Sun, Aug 15, 2021 at 1:10 AM Carl Steinbach  wrote:

> The 0.12.0 release notes are ready for review here:
> https://github.com/apache/iceberg/pull/2973
>
> - Carl
>
> On Sat, Aug 14, 2021 at 6:06 PM Carl Steinbach  wrote:
>
>> Voting is now over. The motion to release RC3 as the Apache Iceberg
>> 0.12.0 release passes with the following results:
>>
>> 3 binding +1s
>> 3 non-binding +1s
>>
>> Thanks.
>>
>> - Carl
>>
>> On Sat, Aug 14, 2021 at 4:42 PM Ryan Blue  wrote:
>>
>>> Everything is still looking good to me. I also tested Spark 3.1 using
>>> the following configuration:
>>>
>>> /home/blue/Apps/spark-3.1.1-bin-hadoop3.2/bin/spark-shell \
>>> --conf 
>>> spark.jars.repositories=https://repository.apache.org/content/repositories/orgapacheiceberg-1018/
>>>  \
>>> --packages org.apache.iceberg:iceberg-spark3-runtime:0.12.0 \
>>> --conf 
>>> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>>>  \
>>> --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
>>> --conf spark.sql.catalog.local.type=hadoop \
>>> --conf 
>>> spark.sql.catalog.local.warehouse=/home/blue/tmp/hadoop-warehouse \
>>> --conf spark.sql.catalog.local.default-namespace=default \
>>> --conf spark.sql.catalog.prodhive=org.apache.iceberg.spark.SparkCatalog 
>>> \
>>> --conf spark.sql.catalog.prodhive.type=hive \
>>> --conf 
>>> spark.sql.catalog.prodhive.warehouse=/home/blue/tmp/prod-warehouse \
>>> --conf spark.sql.catalog.prodhive.default-namespace=default \
>>> --conf spark.sql.defaultCatalog=local
>>>
>>>
>>>- Tested metadata tables (files, manifests, history)
>>>- Tested ALTER TABLE ADD PARTITION
>>>- Tested MERGE INTO
>>>- Tested updating a table to v2 via SET TBLPROPERTIES
>>>- Tested ALTER TABLE DROP PARTITION with v2 behavior (remove field)
>>>- Tested reading data in old partition spec
>>>- Tested DELETE FROM
>>>
>>> I also built local projects using 0.12.0 plus a couple of internal
>>> patches and tests are passing.
>>>
>>> Ryan
>>>
>>> On Sat, Aug 14, 2021 at 2:41 PM Daniel Weeks 
>>> wrote:
>>>
 +1 (binding)

 Verified sigs, sums, license, build, and tests.

 -Dan

 On Fri, Aug 13, 2021 at 5:05 PM Ryan Blue  wrote:

> +1 (binding)
>
>- Checked signatures, checksums, and RAT
>- Ran build and test. There were only failures in
>org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithEngine
>that I think I hit last time
>
> I’ll do more checking over the weekend, but right now it looks good!
>
> On Fri, Aug 13, 2021 at 3:52 PM Carl Steinbach  wrote:
>
>> +1 (binding)
>>
>> * Checked signatures of all artifacts.
>> * Ran build and test to completion without failures.
>> * Verified that RAT checks pass and that dates have the correct year.
>>
>> - Carl
>>
>> On Wed, Aug 11, 2021 at 12:59 AM John Zhuge 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> - Checked signature, checksum, and license.
>>> - Ran build and test (failures in iceberg-mr and iceberg-hive3)
>>>
>>> On Tue, Aug 10, 2021 at 10:12 PM Szehon Ho 
>>> wrote:
>>>
 +1 (non binding)

 * Checked Signature Keys
 * Verified Checksum
 * Rat checks
 * Build and run tests, most functionality pass (also timeout errors
 on Hive-MR)

 Thanks
 Szehon

 On Tue, Aug 10, 2021 at 1:40 AM Ryan Murray 
 wrote:

> +1 (non-binding)
>
> * Verify Signature Keys
> * Verify Checksum
> * dev/check-license
> * Build
> * Run tests (though some timeout failures, on Hive MR test..)
> * ran with Nessie in spark 3.1 and 3.0
>
> On Tue, Aug 10, 2021 at 4:21 AM Carl Steinbach 
> wrote:
>
>> Hi Everyone,
>>
>> I propose the following RC to be released as the official Apache
>> Iceberg 0.12.0 release.
>>
>> The commit ID is 7ca1044655694dbbab660d02cef360ac1925f1c2
>> * This corresponds to the tag: apache-iceberg-0.12.0-rc3
>> *
>> https://github.com/apache/iceberg/commits/apache-iceberg-0.12.0-rc3
>> *
>> https://github.com/apache/iceberg/tree/7ca1044655694dbbab660d02cef360ac1925f1c2
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.12.0-rc3/
>>
>> You can find the KEYS file here:
>> * 

Re: Default TimeZone for unit tests

2021-03-01 Thread Owen O'Malley
In ORC, the timezone tests vary the default timezone through multiple
values using the Java APIs. (They do restore the initial value when the
test exits.) :)

.. Owen

On Mon, Mar 1, 2021 at 9:25 PM Edgar Rodriguez
 wrote:

> Hi folks,
>
> Thanks Peter for the quick fix!
>
> I do think it'd be a good idea to have this kind of coverage to some
> extent. Usually, a workflow some users follow is to only run locally the
> modules that they modify and rely on the CI to run the full check which
> takes longer, which makes room for these issues to land in master while
> eventually someone will find the broken test. However, I do agree that we
> probably should not spend a large amount of time on this - ideally if this
> is possible in CI that'd be great e.g. having two CI jobs, one for UTC and
> another for a different TZ.
>
> Cheers,
>
> On Mon, Mar 1, 2021 at 2:52 PM Ryan Blue 
> wrote:
>
>> I'm not sure it would be worth separating out the timezone tests to do
>> this. I think we catch these problems pretty quickly with the number of
>> users building in different zones. Is this something we want to spend time
>> on?
>>
>> On Mon, Mar 1, 2021 at 10:29 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> In the Spark Cassandra Connector we had a similar issue, we would
>>> specifically spawn test JVM's with different default local time zones to
>>> make sure we handled these use cases, I also would make our test dates ones
>>> on gregorian calendar boundaries so being an hour off with result in a
>>> timestamp that would end up actually being several days off so it was
>>> clear.
>>>
>>> So maybe it makes sense to break out some timestamp specific tests and
>>> have them run with different local timezones? Then you have a UTC, PST, CEU
>>> or whatever test suites to run. If we scope this to just timestamp specific
>>> tests it shouldn't be that much more expensive and I do think the coverage
>>> is important.
>>>
>>> On Mon, Mar 1, 2021 at 12:25 PM Peter Vary 
>>> wrote:
>>>
 Hi Team,

 Last weekend I caused a little bit of stir by pushing changes which had
 a green run on CI, but was failing locally if the default TZ was different
 than UTC.

 Do we want to set the TZ of the CI tests to some random non-UTC TZ to
 catch these errors?

 Pros:

- We can catch tests which are only working in UTC


 Cons:

- I think the typical TZ is UTC in our target environments, so
catching UTC problems might be more important


 I am interested in your thoughts about this.

 Thanks,
 Peter

>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Edgar R
>


Type attributes

2021-01-04 Thread Owen O'Malley
One of the challenges that we have at LinkedIn is that we have a *lot* of
Avro schemas. I'd like to be able to represent those Avro schemas using
Iceberg's types and there are a few challenges:

   - unions
   - enums
   - default values

One way out of those problems without extending the Iceberg type model is
to add type attributes where each sub-type has a logical string to string
map that can hold user-defined attributes.

Another use for those kind of attributes are to mark columns with
classification tags (eg. pii, etc.).

Thoughts,
Owen


Re: Iceberg - Hive schema synchronization

2020-11-24 Thread Owen O'Malley
You left the complex types off of your list (struct, map, array,
uniontype). All of them have natural mappings in Iceberg, except for
uniontype. Interval is supported on output, but not as a column type.
Unfortunately, we have some tables with uniontype, so we'll need a solution
for how to deal with it.

I'm generally in favor of a strict mapping in both type and column
mappings. One piece that I think will help a lot is if we add type
annotations to Iceberg so that for example we could mark a struct as
actually being a uniontype. If someone has the use case where they need to
support Hive's char or varchar types it would make sense to define an
attribute for the max length.

Vivekanand, what kind of conversions are you needing. Hive has a *lot* of
conversions. Many of those conversions are more error-prone than useful.
(For example, I seriously doubt anyone found Hive's conversion of
timestamps to booleans useful...)

.. Owen

On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki 
wrote:

> One of the challenges we've had is that Hive is more flexible with schema
> evolution compared to Iceberg. Are you guys also looking at this aspect?
>
> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary 
> wrote:
>
>> Hi Team,
>>
>> With Shardul we had a longer discussion yesterday about the schema
>> synchronization between Iceberg and Hive, and we thought that it would be
>> good to ask the opinion of the greater community too.
>>
>> We can have 2 sources for the schemas.
>>
>>1. Hive table definition / schema
>>2. Iceberg schema.
>>
>>
>> If we want Iceberg and Hive to work together we have to find a way to
>> synchronize them. Either by defining a master schema, or by defining a
>> compatibility matrix and conversion for them.
>> In previous Hive integrations we can see examples for both:
>>
>>- With Avro there is a possibility to read the schema from the data
>>file directly, and the master schema is the one which is in Avro.
>>- With HBase you can provide a mapping between HBase columns by
>>providing the *hbase.columns.mapping* table property
>>
>>
>> Maybe the differences are caused by how the storage format is perceived
>> Avro being a simple storage format, HBase being an independent query engine
>> - but his is just a questionable opinion :)
>>
>> I would like us to decide how Iceberg - Hive integration should be
>> handled.
>>
>> There are at least 2 questions:
>>
>>1. How flexible we should be with the type mapping between Hive and
>>Iceberg types?
>>   1. Shall we have a strict mapping - This way if we have an Iceberg
>>   schema we can immediately derive the Hive schema from it.
>>   2. Shall we be more relaxed on this - Automatic casting /
>>   conversions can be built into the integration, allowing the users to 
>> skip
>>   view and/or UDF creation for typical conversions
>>2. How flexible we should be with column mappings?
>>   1. Shall we have strict 1-on-1 mapping - This way if we have an
>>   Iceberg schema we can immediately derive the Hive schema from it. We 
>> still
>>   have to omit Iceberg columns which does not have a representation 
>> available
>>   in Hive.
>>   2. Shall we allow flexibility on Hive table creation to chose
>>   specific Iceberg columns instead of immediately creating a Hive table 
>> with
>>   all of the columns from the Iceberg table
>>
>>
>> Currently I would chose:
>>
>>- Strict type mapping because of the following reasons:
>>   - Faster execution (we want as few checks and conversions as
>>   possible, since it will be executed for every record)
>>   - Complexity exponentially increases with every conversion
>>- Flexible column mapping:
>>   - I think it will be a typical situation when we have a huge
>>   Iceberg table storing the facts with big number of columns and we would
>>   like to create multiple Hive tables above that. The problem could be 
>> solved
>>   by creating the table and adding a view above that table, but I think 
>> it
>>   would be more user-friendly if we could avoid this extra step.
>>   - The added complexity is at table creation / query planning which
>>   has far smaller impact on the overall performance
>>
>>
>> I would love to hear your thoughts as well since the choice should really
>> depend on the user base, and what are the expected use-cases.
>>
>> Thanks,
>> Peter
>>
>>
>> Appendix 1 - Type mapping proposal:
>> Iceberg typeHive2 typeHive3 typeStatus
>> boolean BOOLEAN BOOLEAN OK
>> int INTEGER INTEGER OK
>> long BIGINT BIGINT OK
>> float FLOAT FLOAT OK
>> double DOUBLE DOUBLE OK
>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>> binary BINARY BINARY OK
>> date DATE DATE OK
>> timestamp TIMESTAMP TIMESTAMP OK
>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>> string STRING STRING OK
>> uuid STRING or BINARY STRING or BINARY TODO
>> time - - -
>> fixed(L) - - -
>> - TINYINT TINYINT -
>> - SMALLINT 

Re: SQL compatibility of Iceberg Expressions

2020-09-18 Thread Owen O'Malley
No, you can translate these expressions, but you have to evaluate the
entire expression. For example:

"col1 = 'x' and col2 in (1,2)" becomes col1 = 'x' and col2 in (1,2)
"not(col1 = 'x' and col2 in (1,2))" becomes (col1 != 'x' or col2 not in
(1,2)) and col1 is not null and col2 is not null

Furthermore, ORC does (and Parquet should) already use these semantics.
Therefore, you'll end up translating on both sides:

hive-\
presto  --+-> Iceberg -+--> ORC
spark sql   -/  \-> Parquet

Since the non-sql use cases have fewer pushdown predicates, having a
translation on that side seems less error-prone.

.. Owen


On Fri, Sep 18, 2020 at 10:54 PM Ryan Blue 
wrote:

> Are you saying that we can't fix this by rewriting expressions to
> translate from SQL to more natural semantics?
>
> On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley 
> wrote:
>
>> In the SQL world, the second point isn't right. It is still the case that
>> not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
>> valued logic) in SQL is just strange relative to programming languages:
>>
>>- null *=* "x" -> null
>>- null *is distinct from* "x" -> true
>>- *not*(null) -> null
>>- null *and* true -> null
>>- null *or* false -> null
>>
>> We absolutely need a null safe equals function (<=> or "is distinct
>> from") also, which is what we currently have as equals. So we really need
>> to have the logical operators also treat null as a special value.
>>
>> .. Owen
>>
>>
>>
>> On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue 
>> wrote:
>>
>>> It would be nice to avoid the problem by changing the semantics of
>>> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
>>> reasons.
>>>
>>> First, I think that API users creating expressions directly expect the
>>> current behavior. It would be surprising to a user if a notEqual
>>> expression didn’t return nulls. People integrating Iceberg in SQL engines
>>> will be more aware of SQL semantics, especially if the behavior we choose
>>> is documented. I think for API uses, the current behavior is better.
>>>
>>> Second, some evaluations require expressions to be rewritten without not,
>>> so we have a utility that pushes not down an expression tree to the
>>> leaf predicates. Rewriting not(equal("col", "x")) will result in 
>>> notEqual("col",
>>> "x"). If we were to change the semantics of notEqual, then this rewrite
>>> would no longer be valid. If col is null then it is not equal to x, and
>>> negating that result is true. But notEqual would give a different
>>> answer so we can’t rewrite it.
>>>
>>> We can work around the rewrite problem by adding Expressions.sqlNotEqual
>>> method for engines to call that has the SQL semantics by returning 
>>> and(notEqual("col",
>>> "x"), notNull("col")).
>>>
>>> For pushdown, we should add tests for these cases and rewrite
>>> expressions to account for the difference. Iceberg should push 
>>> notEqual("col",
>>> "x") to ORC as SQL (col != 'x' or col is null). Presto can similarly
>>> translate col != 'x' to and(notEqual("col", "x"), notNull("col").
>>>
>>> rb
>>>
>>> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley 
>>> wrote:
>>>
>>>> I think that we should follow the SQL semantics to prevent surprises
>>>> when SQL engines integrate with Iceberg.
>>>>
>>>> .. Owen
>>>>
>>>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>>>> shardulsmaha...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I noticed that Iceberg's predicates are not compatible with SQL
>>>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>>>> operands of a scalar comparison predicate is NULL, then the resultant 
>>>>> truth
>>>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>>>> but including NU

Re: SQL compatibility of Iceberg Expressions

2020-09-18 Thread Owen O'Malley
In the SQL world, the second point isn't right. It is still the case that
not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
valued logic) in SQL is just strange relative to programming languages:

   - null *=* "x" -> null
   - null *is distinct from* "x" -> true
   - *not*(null) -> null
   - null *and* true -> null
   - null *or* false -> null

We absolutely need a null safe equals function (<=> or "is distinct from")
also, which is what we currently have as equals. So we really need to have
the logical operators also treat null as a special value.

.. Owen



On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue  wrote:

> It would be nice to avoid the problem by changing the semantics of
> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
> reasons.
>
> First, I think that API users creating expressions directly expect the
> current behavior. It would be surprising to a user if a notEqual
> expression didn’t return nulls. People integrating Iceberg in SQL engines
> will be more aware of SQL semantics, especially if the behavior we choose
> is documented. I think for API uses, the current behavior is better.
>
> Second, some evaluations require expressions to be rewritten without not,
> so we have a utility that pushes not down an expression tree to the leaf
> predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
> "x"). If we were to change the semantics of notEqual, then this rewrite
> would no longer be valid. If col is null then it is not equal to x, and
> negating that result is true. But notEqual would give a different answer
> so we can’t rewrite it.
>
> We can work around the rewrite problem by adding Expressions.sqlNotEqual
> method for engines to call that has the SQL semantics by returning 
> and(notEqual("col",
> "x"), notNull("col")).
>
> For pushdown, we should add tests for these cases and rewrite expressions
> to account for the difference. Iceberg should push notEqual("col", "x")
> to ORC as SQL (col != 'x' or col is null). Presto can similarly translate col
> != 'x' to and(notEqual("col", "x"), notNull("col").
>
> rb
>
> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley 
> wrote:
>
>> I think that we should follow the SQL semantics to prevent surprises when
>> SQL engines integrate with Iceberg.
>>
>> .. Owen
>>
>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>> shardulsmaha...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I noticed that Iceberg's predicates are not compatible with SQL
>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>> operands of a scalar comparison predicate is NULL, then the resultant truth
>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>>> more rows than required when an engine pushes down these predicates,
>>> however the engines will filter out these additional rows, so everything
>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>> on Iceberg's expression evaluators for filtering will return the additional
>>> rows. Should we change the behavior of Iceberg expressions to be more
>>> SQL-like or should we keep this behavior and document the differences when
>>> compared with SQL?
>>>
>>> This also has some implications on predicate pushdown e.g. ORC follows
>>> SQL semantics and if we try to push down Iceberg predicates, simply
>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>> as it does not return NULLs contrary to what Iceberg expects.
>>>
>>> Thanks,
>>> Shardul
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: SQL compatibility of Iceberg Expressions

2020-09-18 Thread Owen O'Malley
I think that we should follow the SQL semantics to prevent surprises when
SQL engines integrate with Iceberg.

.. Owen

On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik 
wrote:

> Hi all,
>
> I noticed that Iceberg's predicates are not compatible with SQL predicates
> when it comes to handling NULL values. In SQL, if any of the operands of a
> scalar comparison predicate is NULL, then the resultant truth value of the
> predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a NULL in SQL and
> not FALSE. If such predicates are used as filters, the resultant output
> will be different for Iceberg v/s SQL. e.g. `.filter(notEqual(column,
> 'x'))` in Iceberg will return rows excluding ‘x’ but including NULL. The
> same thing in Presto SQL `WHERE column != 'x'` will return rows excluding
> both ‘x’ and NULL. So essentially, Iceberg can return more rows than
> required when an engine pushes down these predicates, however the engines
> will filter out these additional rows, so everything seems good. But
> modules like iceberg-data and iceberg-mr which rely solely on Iceberg's
> expression evaluators for filtering will return the additional rows. Should
> we change the behavior of Iceberg expressions to be more SQL-like or should
> we keep this behavior and document the differences when compared with SQL?
>
> This also has some implications on predicate pushdown e.g. ORC follows SQL
> semantics and if we try to push down Iceberg predicates, simply converting
> Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient as it does
> not return NULLs contrary to what Iceberg expects.
>
> Thanks,
> Shardul
>


Re: Iceberg sync notes - 9 September 2020

2020-09-14 Thread Owen O'Malley
As I mentioned in the meetup, ORC 1.6.4
 was pending and has
been released. It should be available on Maven central tomorrow.

.. Owen

On Mon, Sep 14, 2020 at 10:38 PM Ryan Blue 
wrote:

> Hi everyone,
>
> I just update the Iceberg sync doc
> 
> with my notes. Feel free to add corrections or additional context!
>
> There was quite a bit of discussion, so I want to highlight a few things
> that we talked about for more discussion on the dev list:
>
> 1. 0.10.0 blocker issues
> - Java 11 flaky tests (Fixed in PR #1446
> )
> - Flink checkpoint Java serialization errors (PR #1438
> )
> - Probably will *not* wait for Hive projection
> - Please bring up any other blockers!
> 2. The general consensus was that adding a time offset parameter (PR #1368
> ) is not a good solution.
> Instead we should consider using hourly partitioning or adding custom
> partition functions.
> 3. We discussed trying to make snapshot timestamps monotonically
> increasing, but though that it was probably not worth pursuing (already
> mentioned on the dev list thread).
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] August board report

2020-08-12 Thread Owen O'Malley
+1 looks good.

On Wed, Aug 12, 2020 at 4:41 PM Ryan Blue  wrote:

> Hi everyone,
>
> Here's a draft of the board report for this month. Please reply with
> anything that you'd like to see added or that I've missed. Thanks!
>
> rb
>
> ## Description:
> Apache Iceberg is a table format for huge analytic datasets that is
> designed
> for high performance and ease of use.
>
> ## Issues:
> There are no issues requiring board attention.
>
> ## Membership Data:
> Apache Iceberg was founded 2020-05-19 (2 months ago)
> There are currently 10 committers and 9 PMC members in this project.
> The Committer-to-PMC ratio is roughly 1:1.
>
> Community changes, past quarter:
> - No new PMC members (project graduated recently).
> - Shardul Mahadik was added as committer on 2020-07-25
>
> ## Project Activity:
> 0.9.0 was released, including support for Spark 3 and SQL DDL commands,
> support
> for JDK 11, vectorized Parquet reads, and an action to compact data files.
>
> Since the 0.9.0 release, the community has made progress in several areas:
> - The Hive StorageHandler now provides access to query Iceberg tables
>   (work is ongoing to implement projection and predicate pushdown).
> - Flink integration has made substantial progress toward using native
> RowData,
>   and the first stage of the Flink sink (data file writers) has been
> committed.
> - An action to expire snapshots using Spark was added and is an
> improvement on
>   the incremental approach because it compares the reachable file sets.
> - The implementation of row-level deletes is nearing completion. Scan
> planning
>   now supports delete files, merge-based and set-based row filters have
> been
>   committed, and delete file writers are under review. The delete file
> writers
>   allow storing deleted row data in support of Flink CDC use cases.
>
> Releases:
> - 0.9.0 was released on 2020-07-13
> - 0.9.1 has an ongoing vote
>
> ## Community Health:
> The month since the last report has been one of the busiest since the
> project
> started. 80 pull requests were merged in the last 4 weeks, and more
> importantly,
> came from 21 different contributors. Both of these are new high watermarks.
>
> Community members gave 2 Iceberg talks at Subsurface Conf, on enabling Hive
> queries against Iceberg tables and working with petabyte-scale Iceberg
> tables.
> Iceberg was also mentioned in the keynotes.
>
> --
> Ryan Blue
>


Re: [VOTE] Release Apache Iceberg 0.9.0 RC5

2020-07-13 Thread Owen O'Malley
On Mon, Jul 13, 2020 at 4:28 PM Anton Okolnychyi
 wrote:

> I think the issue that was brought up by Dongjoon is valid and we should
> document the current caching behavior.
> The problem is also more generic and does not apply only to views as
> operations that are happening through the source directly may not
> propagated to the catalog.
>

I agree.

I think that by default the table state should be rechecked at the start of
each statement/query.

.. Owen


Re: [VOTE] Release Apache Iceberg 0.9.0 RC5

2020-07-13 Thread Owen O'Malley
+1 (binding)

   - Verified signatures
   - Verified checksum
   - Built src from tarball and ran tests.
   - Looked at JMH dependency to make sure it wasn't leaking into the
   published artifacts.

.. Owen

On Mon, Jul 13, 2020 at 11:00 AM RD  wrote:

> +1
> - verified signatures and checksum
> - Ran RAT checks
> - Build src and ran all tests
> - Ran a simple spark job.
>
> -Best,
> R.
>
>
>
>
> On Mon, Jul 13, 2020 at 8:36 AM Junjie Chen 
> wrote:
>
>> I ran the following steps:
>>- downloaded and verified signature and checksum.
>>- ran ./gradlew build, it took 8m23s on an 8core16g cloud virtual
>> machine.
>>- rebuilt our app with iceberg-spark-runtime-0.9.0.jar and verified on
>> a spark cluster. It works well.
>>
>> +1 (non-binding)
>>
>> On Mon, Jul 13, 2020 at 4:49 PM Jingsong Li 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> - verified signature and checksum
>>> - built from source and run tests
>>> - Validated Spark3: Used Ryan's example command, played with Spark3,
>>> looks very good.
>>> - Validated vectorized reads:  open vectorization-enabled, works well.
>>>
>>> Best,
>>> Jingsong
>>>
>>> On Mon, Jul 13, 2020 at 2:37 PM Gautam  wrote:
>>>

 *Followed the steps:*
 1. Downloaded the source tarball, signature (.asc), and checksum
 (.sha512) from
 https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/
 2. Downloaded
 https://dist.apache.org/repos/dist/dev/incubator/iceberg/KEYS
  Import gpg keys: download KEYS and run gpg --import
 /path/to/downloaded/KEYS
 3. Verified the signature by running: gpg --verify
 apache-iceberg-0.9.0.tar.gz.asc
 4. Verified the checksum by running: sha512sum -c
 apache-iceberg-0.9.0.tar.gz.sha512
 5. Untared the archive and go into the source directory: tar xzf
 apache-iceberg-0.9.0.tar.gz && cd apache-iceberg-0.9.0
 6. Ran RAT checks to validate license headers:
   dev/check-license
 7. Build and test the project: ./gradlew build (using Java 8)
  > Build Took ~10mins

 *+1 (non-binding)*


 On Fri, Jul 10, 2020 at 9:20 AM Ryan Murray  wrote:

> 1. Verify the signature: OK
> 2. Verify the checksum: OK
> 3. Untar the archive tarball: OK
> 4. Run RAT checks to validate license headers: RAT checks passed
> 5. Build and test the project: all unit tests passed.
>
> +1 (non-binding)
>
> I did see that my build took >12 minutes and used all 100% of all 8
> cores & 32GB of memory (openjdk-8 ubuntu 18.04) which I haven't noticed
> before.
> On Fri, Jul 10, 2020 at 4:37 AM OpenInx  wrote:
>
>> I followed the verify guide here (
>> https://lists.apache.org/thread.html/rd5e6b1656ac80252a9a7d473b36b6227da91d07d86d4ba4bee10df66%40%3Cdev.iceberg.apache.org%3E)
>> :
>>
>> 1. Verify the signature: OK
>> 2. Verify the checksum: OK
>> 3. Untar the archive tarball: OK
>> 4. Run RAT checks to validate license headers: RAT checks passed
>> 5. Build and test the project: all unit tests passed.
>>
>> +1 (non-binding).
>>
>> On Fri, Jul 10, 2020 at 9:46 AM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I propose the following RC to be released as the official Apache
>>> Iceberg 0.9.0 release.
>>>
>>> The commit id is 4e66b4c10603e762129bc398146e02d21689e6dd
>>> * This corresponds to the tag: apache-iceberg-0.9.0-rc5
>>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.9.0-rc5
>>> * https://github.com/apache/iceberg/tree/4e66b4c1
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/
>>>
>>> You can find the KEYS file here:
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged in Nexus. The Maven
>>> repository URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008/
>>>
>>> This release includes support for Spark 3 and vectorized reads for
>>> flat schemas in Spark.
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>>
>>> [ ] +1 Release this as Apache Iceberg 0.9.0
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>>
>>> --
>>> Best, Jingsong Lee
>>>
>>
>>
>> --
>> Best Regards
>>
>


Re: [VOTE] Graduate to a top-level project

2020-05-12 Thread Owen O'Malley
+1

On Tue, May 12, 2020 at 2:16 PM Ryan Blue  wrote:

> Hi everyone,
>
> I propose that the Iceberg community should petition to graduate from the
> Apache Incubator to a top-level project.
>
> Here is the draft board resolution:
>
> Establish the Apache Iceberg Project
>
> WHEREAS, the Board of Directors deems it to be in the best interests of
> the Foundation and consistent with the Foundation's purpose to establish
> a Project Management Committee charged with the creation and maintenance
> of open-source software, for distribution at no charge to the public,
> related to managing huge analytic datasets using a standard at-rest
> table format that is designed for high performance and ease of use..
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC), to be known as the "Apache Iceberg Project", be and hereby is
> established pursuant to Bylaws of the Foundation; and be it further
>
> RESOLVED, that the Apache Iceberg Project be and hereby is responsible
> for the creation and maintenance of software related to managing huge
> analytic datasets using a standard at-rest table format that is designed
> for high performance and ease of use; and be it further
>
> RESOLVED, that the office of "Vice President, Apache Iceberg" be and
> hereby is created, the person holding such office to serve at the
> direction of the Board of Directors as the chair of the Apache Iceberg
> Project, and to have primary responsibility for management of the
> projects within the scope of responsibility of the Apache Iceberg
> Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and hereby are
> appointed to serve as the initial members of the Apache Iceberg Project:
>
>  * Anton Okolnychyi 
>  * Carl Steinbach   
>  * Daniel C. Weeks  
>  * James R. Taylor  
>  * Julien Le Dem
>  * Owen O'Malley
>  * Parth Brahmbhatt 
>  * Ratandeep Ratti  
>  * Ryan Blue
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Ryan Blue be appointed to
> the office of Vice President, Apache Iceberg, to serve in accordance
> with and subject to the direction of the Board of Directors and the
> Bylaws of the Foundation until death, resignation, retirement, removal
> or disqualification, or until a successor is appointed; and be it
> further
>
> RESOLVED, that the Apache Iceberg Project be and hereby is tasked with
> the migration and rationalization of the Apache Incubator Iceberg
> podling; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache Incubator
> Iceberg podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Petition the IPMC to graduate to top-level project
> [ ] +0
> [ ] -1 Wait to graduate because . . .
> --
> Ryan Blue
>


Re: [DISCUSS] Graduating from the Apache Incubator

2020-05-11 Thread Owen O'Malley
+1 to graduation. It is exciting watching the project and its community
grow.

.. Owen

On Mon, May 11, 2020 at 11:26 AM Ryan Blue  wrote:

> Hi everyone,
>
> I think that Iceberg is about ready to graduate from the Apache Incubator.
> We now have 2 releases — that include convenience binaries — and have added
> 2 committers/PPMC members and 2 PPMC members from the original set of
> committers. We are seeing a steady rate of contributions from a diverse
> group of people and companies interested in Iceberg. Thank you all for your
> contributions and for being part of this community!
>
> The next step is to agree as a community that we would like to graduate.
> If you have any concerns about graduation, please raise them.
>
> Below is the draft resolution for the board to create an Apache Iceberg
> TLP. This is mostly boilerplate, but I’ve added 2 things:
>
>1. I’d like to volunteer to be the PMC chair of the project so I’ve
>added myself to the draft. Others are welcome to volunteer as well and we
>can decide as a community.
>2. The project description I filled in is: software related to
>“managing huge analytic datasets using a standard at-rest table format that
>is designed for high performance and ease of use”.
>
> Establish the Apache Iceberg Project
>
> WHEREAS, the Board of Directors deems it to be in the best interests of
> the Foundation and consistent with the Foundation's purpose to establish
> a Project Management Committee charged with the creation and maintenance
> of open-source software, for distribution at no charge to the public,
> related to managing huge analytic datasets using a standard at-rest
> table format that is designed for high performance and ease of use..
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC), to be known as the "Apache Iceberg Project", be and hereby is
> established pursuant to Bylaws of the Foundation; and be it further
>
> RESOLVED, that the Apache Iceberg Project be and hereby is responsible
> for the creation and maintenance of software related to managing huge
> analytic datasets using a standard at-rest table format that is designed
> for high performance and ease of use; and be it further
>
> RESOLVED, that the office of "Vice President, Apache Iceberg" be and
> hereby is created, the person holding such office to serve at the
> direction of the Board of Directors as the chair of the Apache Iceberg
> Project, and to have primary responsibility for management of the
> projects within the scope of responsibility of the Apache Iceberg
> Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and hereby are
> appointed to serve as the initial members of the Apache Iceberg Project:
>
>  * Anton Okolnychyi 
>  * Carl Steinbach   
>  * Daniel C. Weeks  
>  * James R. Taylor  
>  * Julien Le Dem
>  * Owen O'Malley
>  * Parth Brahmbhatt 
>  * Ratandeep Ratti  
>  * Ryan Blue
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Ryan Blue be appointed to
> the office of Vice President, Apache Iceberg, to serve in accordance
> with and subject to the direction of the Board of Directors and the
> Bylaws of the Foundation until death, resignation, retirement, removal
> or disqualification, or until a successor is appointed; and be it
> further
>
> RESOLVED, that the Apache Iceberg Project be and hereby is tasked with
> the migration and rationalization of the Apache Incubator Iceberg
> podling; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache Incubator
> Iceberg podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
>
> --
> Ryan Blue
>


Re: [VOTE] Release Apache Iceberg 0.8.0-incubating RC2

2020-04-30 Thread Owen O'Malley
+1

   1. Checked signature and checksum
   2. Built and ran unit tests.
   3. Checked ORC version :)

On Monday, ORC released 1.6.3, so we should grab those fixes soon.

.. Owen

On Thu, Apr 30, 2020 at 12:34 PM Dongjoon Hyun 
wrote:

> +1.
>
> 1. Verified checksum, sig, and license
> 3. Build from the source and run UTs.
> 4. Run some manual ORC write/read tests with Apache Spark 2.4.6-SNAPSHOT
> (as of today).
>
> Thank you, all!
>
> Bests,
> Dongjoon.
>
> On Thu, Apr 30, 2020 at 10:28 AM parth brahmbhatt <
> brahmbhatt.pa...@gmail.com> wrote:
>
>> +1. checks passed, did not observe the unit test failure.
>>
>> Thanks
>> Parth
>>
>> On Thu, Apr 30, 2020 at 9:13 AM Daniel Weeks  wrote:
>>
>>> +1 all checks passed
>>>
>>> On Thu, Apr 30, 2020 at 8:53 AM Anton Okolnychyi
>>>  wrote:
>>>
 That test uses many concurrent writes and I’ve seen cases when it led
 to deadlocks in our test HMS. I think HMS is capable of recovering on its
 own but that process can be slow in highly concurrent environments. There
 is a 2 min timeout in that test so it can potentially fail. I’ve seen a
 deadlock but 2 min was always enough for that test in my local env and
 internal/upstream build pipelines. If there is an environment that
 constantly or frequently hits this problem, it would be great to check
 debug logs.

 I am +1 on releasing RC2. I checked it locally.

 - Anton

 On 30 Apr 2020, at 02:52, Mass Dosage  wrote:

 The build for RC2 worked fine for me, I didn't get a failure on
 "TestHiveTableConcurrency". Perhaps there is some kind of race condition in
 the test? I have seen timeout errors like that when I ran tests on an
 overloaded machine, could that have been the case?

 On Thu, 30 Apr 2020 at 08:32, OpenInx  wrote:

> I checked the rc2, seems the TestHiveTableConcurrency is broken, may
> need to fix it.
>
> 1. Download the tarball and check the signature & checksum: OK
> 2. license checking: RAT checks passed.
> 3. Build and test the project (java8):
> org.apache.iceberg.hive.TestHiveTableConcurrency >
> testConcurrentConnections FAILED
> java.lang.AssertionError: Timeout
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at
> org.apache.iceberg.hive.TestHiveTableConcurrency.testConcurrentConnections(TestHiveTableConcurrency.java:106)
>
> On Thu, Apr 30, 2020 at 9:29 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I propose the following candidate to be released as the official
>> Apache Iceberg 0.8.0-incubating release.
>>
>> The commit id is 8c05a2f5f1c8b111c049d43cf15cd8a51920dda1
>> * This corresponds to the tag: apache-iceberg-0.8.0-incubating-rc2
>> *
>> https://github.com/apache/incubator-iceberg/commits/apache-iceberg-0.8.0-incubating-rc2
>> * https://github.com/apache/incubator-iceberg/tree/8c05a2f5
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc2/
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/incubator/iceberg/KEYS
>>
>> Convenience binary artifacts are staged in Nexus. The Maven
>> repository URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1006/
>>
>> This release contains many bug fixes and several new features:
>> * Actions to remove orphaned files and to optimize metadata for query
>> performance
>> * Support for ORC data files
>> * Snapshot cherry-picking
>> * Incremental scan planning based on table history
>> * In and notIn expressions
>> * An InputFormat for writing MR jobs
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 0.8.0-incubating
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> --
>> Ryan Blue
>>
>



Re: [DISCUSS] September report

2019-09-06 Thread Owen O'Malley
On Fri, Sep 6, 2019 at 12:19 AM Justin Mclean  wrote:

> So why does the project think it's ready to graduate? Mentors do you think
> the project is ready to graduate?
>

It has to make a release or two, but I agree with Ryan that it approaching
graduation. The project entered Apache with five Apache members from
different companies. It has grown the community to include a few more
companies. I think it is doing great.

.. Owen


Re: [DISCUSS] September report

2019-09-06 Thread Owen O'Malley
On Wed, Sep 4, 2019 at 4:55 PM Ryan Blue  wrote:

> Hi everyone,
>
> Here's a draft of this month's report to the IPMC. Please reply with
> comments if you'd like to add anything!
>
> rb
>
> ## Iceberg
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> Iceberg has been incubating since 2018-11-16.
>
> ### Three most important unfinished issues to address before graduating:
>
> 1. Make the first Apache release. (https://github.com/apache/incubator-
> iceberg/milestone/1)
> 2. Grow the Iceberg community
> 3. Add more committers and PPMC members
>
> ### Are there any issues that the IPMC or ASF Board need to be aware of?
>
> No issues.
>
> ### How has the community developed since the last report?
>
> The community continues to grow steadily. In the last month:
> * 59 pull requests have been merged
> * 17 people contributed the merged PRs
> * 18 issues have been closed, 22 issues were opened
>

Presentations were given at:
* Berlin Buzzwords (June 2019)
* ApacheCon NA (Sep 2019)

Iceberg is being used in production at Netflix on huge tables, up to 25
petabytes.


> For comparison, the last report had 74 pull requests merged over 3 months.
>
> ### How has the project developed since the last report?
>
> * License documentation has been completed for the Java project,
> unblocking the first release
> * Added more documentation to iceberg.apache.org
> * Started vectorized read branch with significantly better performance
> * Added metadata tables
> * Added configuration to control statistics and truncate long values
> * Improved Hive Metastore integration
> * A working python read path has been submitted in PRs
>
> ### How would you assess the podling's maturity?
>
>   - [ ] Initial setup
>   - [x] Working towards first release
>   - [x] Community building
>   - [x] Nearing graduation
>   - [ ] Other:
>
> ### Date of last release:
>
> * No release yet
>
> ### When were the last committers or PPMC members elected?
>
> * Anton Okolnychyi was added 30 August 2019
>
> ### Have your mentors been helpful and responsive?
>
> Yes
>
> +1 to the report.

-- 
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

2019-08-07 Thread Owen O'Malley



> On Jul 24, 2019, at 22:52, Adrien Guillo  
> wrote:
> 
> Hi Iceberg folks,
> 
> In the last few months, we (the data infrastructure team at Airbnb) have been 
> closely following the project. We are currently evaluating potential 
> strategies to migrate our data warehouse to Iceberg. However, we have a very 
> large Hive deployment, which means we can’t really do so without support for 
> Iceberg tables in Hive.
> 
> We have been thinking about implementation strategies. Here are some thoughts 
> that we would like to share them with you:
> 
> – Implementing a new `RawStore`

My thought would be to use the embedded metastore with an iceberg rawstore. 
That enables the client to do the work rather than pushing it to an external 
metastore. 

I expect that some new users would be able to just use iceberg as their only 
metastore, but that others would want to have some databases in hive layout and 
others in iceberg. We could use a delegating raw store for that.

> This is something that has been mentioned several times on the mailing list 
> and seems to indicate that adding support for Iceberg tables in Hive could be 
> achieved without client-side modifications. Does that mean that the Metastore 
> is the only process manipulating Iceberg metadata (snapshots, manifests)? 
> Does that mean that for instance the `listPartition*` calls to the Metastore 
> return the DataFiles associated with each partition? Per our understanding, 
> it seems that supporting Iceberg tables in Hive with this strategy will most 
> likely require to update the RawStore interface AND will require at least 
> some client-side changes. In addition, with this strategy the Metastore bears 
> new responsibilities, which contradicts one of the Iceberg design goals: 
> offloading more work to jobs and removing the metastore as a bottleneck. In 
> the Iceberg world, not much is needed from the Metastore: it just keeps track 
> of the metadata location and provides a mechanism for atomically updating 
> this location (basically, what is done in the `HiveTableOperations` class). 
> We would like to design a solution that relies  as little as possible on the 
> Metastore so that in future we have the option to replace our fleet of 
> Metastores with a simpler system.
> 
> 
> – Implementing a new `HiveStorageHandler`

With an iceberg raw store, I suspect that you might not need a storage handler 
and could go straight to a input/output format. You would probably need an 
input and output format for each of the storage formats: 
Iceberg{Orc,Parquet,Avro}{Input,Output}Format.

.. Owen
> 
> We are working on implementing custom `InputFormat` and `OutputFormat` 
> classes for Iceberg (more on that in the next paragraph) and they would fit 
> in nicely with the `HiveStorageHandler` and `HiveStoragePredicateHandler` 
> interfaces. However, the `HiveMetaHook` interface does not seem rich enough 
> to accommodate all the workflows, for instance no hooks run on `ALTER ...`  
> or `INSERT...` commands.
> 
> 
> 
> – Proof of concept
> 
> We set out to write a proof of concept that would allow us to learn and 
> experiment. We based our work on the 2.3 branch. Here’s the state of the 
> project and the paths we explored:
> 
> DDL commands
> We support some commands such as `CREABLE TABLE ...`, `DESC ...`, `SHOW 
> PARTITIONS`. They are all implemented in the client and mostly rely on the 
> `HiveCatalog` class to do the work.
> 
> Read path
> We are in the process of implementing a custom `FileInputFormat` that 
> receives an Iceberg table identifier and a serialized expression 
> `ExprNodeDesc` as input. This is similar in a lot of ways to what you can 
> find in the `PigParquetReader` class in the `iceberg-pig` package or in 
> `HiveHBaseTableInputFormat` class in Hive.
> 
> 
> Write path
> We have made less progress in that front but we see a path forward by 
> implementing a custom `OutputFormat` that would keep track of the files that 
> are being written and gather statistics. Then, each task can dump this 
> information on HDFS. From there, the final Hive `MoveTask` can merge those 
> “pre-manifest” files to create a new snapshot and commit the new version of a 
> table.
> 
> 
> We hope that our observations will start a healthy conversation about 
> supporting Iceberg tables in Hive :)
> 
> 
> Cheers,
> Adrien


Re: Sort Spec

2019-07-18 Thread Owen O'Malley
I agree that we need to manage changes to the sort order, just like we need
to handle changes to the schema. Neither one should require rewriting data
immediately, but when data is compacted or restated, it could be sorted to
the new order.

.. Owen

On Thu, Jul 18, 2019 at 10:01 AM Ryan Blue  wrote:

> > This one seems really problematic. Too many important optimizations
> depend on the file sort order. Can we have the writer verify the sort order
> as the files are written
>
> Even if we did, when the desired sort order changes we can't just rewrite
> all of the data in the table. I think that this will lead to cases where
> performance degrades without regular maintenance, but I don't see a way to
> handle that other than degrading performance. Forcing a rewrite to sort
> data when the desired order changes just isn't possible, right?
>
> On Thu, Jul 18, 2019 at 7:33 AM Owen O'Malley 
> wrote:
>
>>
>>
>> On Thu, Jul 18, 2019 at 5:30 AM Anton Okolnychyi
>>  wrote:
>>
>>> Let me summarize what we talked here and follow up with a PR.
>>>
>>> - Iceberg should allow users to define a sort oder in its metadata that
>>> applies to partitions.
>>> - We should never assume the sort order is actually applied to all files
>>> in the table.
>>>
>>
>> This one seems really problematic. Too many important optimizations
>> depend on the file sort order. Can we have the writer verify the sort order
>> as the files are written?
>>
>> - Sort orders might evolve and change over time. When this happens,
>>> existing files will not be rewritten. Query engines should follow the
>>> updated sort order during subsequent writes. As a result, files within a
>>> table or partition can be sorted differently at a given point in time.
>>> - We should be able to define a sort order even for unpartitioned
>>> tables, as opposed to current Spark tables that allow a sort order only for
>>> bucketed tables.
>>> - SortOrder is separate from PartitionSpec.
>>> - SortOrder will rely on transformations to define complex sort orders.
>>> - Files will be annotated with sort_order_id instead of sort_columns. We
>>> keep the question of file_ordinal open for now.
>>> - To begin with, we will support asc/desc natural sort orders (UTF8
>>> ordering for Strings).
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 16 Jul 2019, at 23:56, Ryan Blue  wrote:
>>>
>>> I agree that Iceberg metadata should include a way to configure a
>>> desired sort order. But I want to note that I don’t think that we can ever
>>> assume that it has been applied. Table configuration will evolve as use
>>> changes. We don’t want to require rewrites when a configuration gets
>>> updated, so an assumption should be that data files might not be sorted.
>>>
>>> Files that are sorted should indicate how they are sorted, so that
>>> optimizations are applied if the file’s metadata indicates it can be safely
>>> applied. For example, if both deletes and data rows are sorted the same
>>> way, you can merge the two streams instead of using a hash set to check
>>> whether a record has been deleted. I think this should rely on the delete
>>> file’s sort order matching the data file it is applied to.
>>>
>>> Should Iceberg allow users to define a sort spec only if the table is
>>> bucketed?
>>>
>>> No. In Iceberg, bucketing is just another partition transform.
>>>
>>> However, I think we need to consider what a sort order will mean. Here
>>> are a few observations:
>>>
>>>- Each file can have a sort order for its rows (Spark’s
>>>sortWithinPartitions, which sorts each task’s data)
>>>- Sorting is also used to cluster values across files so it makes
>>>sense for a table sort order to be applied within partitions (ORDER
>>>BY)
>>>- Multiple writes to the same partition are not expected to rewrite
>>>existing data, so a partition may only be partially sorted or may have
>>>multiple sorted file sets
>>>- Partitioning is independent from sorting. Even when partitioning
>>>is orthogonal to a sort order (i.e., bucketing), partitioning must still
>>>take precedence.
>>>
>>> My conclusion is that a configured sort order applies to partitions, not
>>> data across partitions. Again, bucketing is just another type of partition.
>>>
>>> How should Iceberg encode sort specs?
>>>
>>> I don’t thin

Re: Updates/Deletes/Upserts in Iceberg

2019-07-03 Thread Owen O'Malley
It works for me too. 

.. Owen

> On Jul 3, 2019, at 11:27, Anton Okolnychyi  
> wrote:
> 
> Works for me too.
> 
>> On 3 Jul 2019, at 19:09, Erik Wright  wrote:
>> 
>> That works for me.
>> 
>> On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue  wrote:
>>> How about 9AM PDT on Friday, 5 July then?
>>> 
>>>> On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley  
>>>> wrote:
>>>> I'd like to call in, but I'm out Thursday. Friday would work except 11am 
>>>> to 1pm pdt.
>>>> 
>>>> .. Owen
>>>> 
>>>>> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue  
>>>>> wrote:
>>>>> I'm available Thursday and Friday this week as well, but it's a holiday 
>>>>> in the US so some people may be out. If there are no objections from 
>>>>> anyone that would like to attend, then I'm up for that.
>>>>> 
>>>>>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi  
>>>>>> wrote:
>>>>>> I apologize for the delay on my side. I’ll still have to go through the 
>>>>>> last emails. I am available on Thursday/Friday this week and would be 
>>>>>> great to sync.
>>>>>> 
>>>>>> Thanks,
>>>>>> Anton
>>>>>> 
>>>>>>> On 3 Jul 2019, at 01:29, Ryan Blue  wrote:
>>>>>>> 
>>>>>>> Sorry I didn't get back to this thread last week. Let's try to have a 
>>>>>>> video call to sync up on this next week. What days would work for 
>>>>>>> everyone?
>>>>>>> 
>>>>>>> rb
>>>>>>> 
>>>>>>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright  
>>>>>>>> wrote:
>>>>>>>> With regards to operation values. Currently they are:
>>>>>>>> append: data files were added and no files were removed.
>>>>>>>> replace: data files were rewritten with the same data; i.e., 
>>>>>>>> compaction, changing the data file format, or relocating data files.
>>>>>>>> overwrite: data files were deleted and added in a logical overwrite 
>>>>>>>> operation.
>>>>>>>> delete: data files were removed and their contents logically deleted.
>>>>>>>> If deletion files (with or without data files) are appended to the 
>>>>>>>> dataset, will we consider that an `append` operation? If so, if 
>>>>>>>> deletion and/or data files are appended, and whole files are also 
>>>>>>>> deleted, will we consider that an `overwrite`?
>>>>>>>> 
>>>>>>>> Given that the only apparent purpose of the operation field is to 
>>>>>>>> optimize snapshot expiration the above seems to meet its needs. An 
>>>>>>>> incremental reader can also skip `replace` snapshots but no others. 
>>>>>>>> Once it decides to read a snapshot I don't think there's any 
>>>>>>>> difference in how it processes the data for append/overwrite/delete 
>>>>>>>> cases.
>>>>>>>> 
>>>>>>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue  wrote:
>>>>>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes, 
>>>>>>>>> since they apply to a specific file. They’re not harmful, but the 
>>>>>>>>> don’t seem relevant.
>>>>>>>>> 
>>>>>>>>> These delete files will probably contain a path and an offset and 
>>>>>>>>> could contain deletes for multiple files. In that case, the sequence 
>>>>>>>>> number can be used to eliminate delete files that don’t need to be 
>>>>>>>>> applied to a particular data file, just like the column equality 
>>>>>>>>> deletes. Likewise, it can be used to drop the delete files when there 
>>>>>>>>> are no data files with an older sequence number.
>>>>>>>>> 
>>>>>>>>> I don’t understand the purpose of the min sequence number, nor what 
>>>>>>>>> the “min data seq” is.
>>>>>>>>> 
>>>>>>>>> Min sequence number would be used for pruning delete files without 
>>>>&

Re: IPMC report draft for July 2019

2019-07-03 Thread Owen O'Malley
+1 from me too. 

.. Owen

> On Jul 2, 2019, at 17:32, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> Here's a draft report for this quarter. Please comment and I'll post the 
> final version tomorrow. Thanks!
> 
> rb
> 
> 
> Iceberg
> 
> Iceberg is a table format for large, slow-moving tabular data.
> 
> Iceberg has been incubating since 2018-11-16.
> 
> Three most important issues to address in the move towards graduation:
> 
>   1. Update build for Apache release, add LICENSE/NOTICE to Jars.
>   2. Make the first Apache release.
> (https://github.com/apache/incubator-iceberg/milestone/1)
>   3. Grow the Iceberg community
> 
> Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
> aware of?
> 
>   * No issues that require attention.
> 
> How has the community developed since the last report?
> 
>   * Community growth has continued with several new contributors and reviewers
>   * Community has decided on style and added checking to CI for most modules
>   * Community has started work on extending the spec for new use cases
> 
> How has the project developed since the last report?
> 
>   * Much more content on iceberg.apache.org has been added
>   * 74 pull requests have been merged, many reviewed by new community members
>   * Work has begun to add row-level deletes and upserts to the format
>   * Added support for Spark streaming, a catalog API, and numerous bug fixes
>   * Contributors are reviewing code, submitting substantial features, and 
> improving dev practices
> 
> How would you assess the podling's maturity?
> Please feel free to add your own commentary.
> 
>   [ ] Initial setup (name clearance approval pending)
>   [X] Working towards first release
>   [X] Community building
>   [ ] Nearing graduation
>   [ ] Other:
> 
> Date of last release:
> 
>   None yet
> 
> When were the last committers or PPMC members elected?
> 
>   None yet
> 
> Have your mentors been helpful and responsive or are things falling
> through the cracks? In the latter case, please list any open issues
> that need to be addressed.
> 
>   Yes.
> 
> Signed-off-by:
> 
>   [X](iceberg) Ryan Blue
>  Comments: I wrote the first pass of the report.
>   [ ](iceberg) Julien Le Dem
>  Comments:
>   [ ](iceberg) Owen O'Malley
>  Comments:
>   [ ](iceberg) James Taylor
>  Comments:
>   [ ](iceberg) Carl Steinbach
>  Comments:
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: Updates/Deletes/Upserts in Iceberg

2019-06-12 Thread Owen O'Malley


> On May 21, 2019, at 1:31 PM, Jacques Nadeau  wrote:
> 
> The main thing I'm talking about is how you target a deletion across time. If 
> you have a file A, and you want to delete record X in A, you define delete 
> A.X. At the same time, another process may be compacting A into A'. In so 
> doing, the position of A.X in A' is something other than X.

I would argue that this is backwards. This argues that compactions need a lock 
so that the delete either happens before or after the compaction. If it happens 
before, the delete is incorporated into the new version of the file. If it 
happens afterwards, the delete is using the new version of the file.

> At this point, the deletion needs to be rerun against A' so that we can 
> ensure that the deletion is propagated forward. If the only thing you have is 
> A.X, you need to have way from of getting to the same location in A'. You 
> should be able to take the delta file that lists the delete of A.2 and apply 
> it directly to A' without having to also consult A. If you didn't need to 
> solve this number, then you could simply use A.X as opposed to the key of A.X 
> in your delta files.

I’d much prefer using file/row# as the reference for the synthetic key for the 
deletes. Thus, it would be file 1: rows 100, 200, and 300. That makes it clear 
that the delta can only be applied to a given version of the file. This has the 
added advantage that you know how many rows are left in file.

.. Owen

> 
> Synthetic seems relative. If the synthetic key is client-supplied, in what 
> way is it relevant to Iceberg whether it is synthetic vs. natural? By calling 
> it synthetic within Iceberg there is a strong implication that it is the 
> implementation that generates it (the filename/position key suggests that). 
> If it's the client that supplies it, it _may_ be synthetic (from the point of 
> view of the overall data model; i.e. a customer key in a database vs. a 
> customer ID that shows up on a bill) but from Iceberg's case that doesn't 
> matter. Only the unicity constraint does.
> 
> I agree with the main statement here: the only real requirement is keys need 
> to be unique across all existing snapshots. There could be two generators: 
> one that uses an iceberg internal behavior to generate keys and one that is 
> user definable. While there could be a third which uses an existing field (or 
> set of fields) to define the key I think we probably should avoid 
> implementing this as it has a whole other sets of problems that are best left 
> outside of Iceberg's area of concern.
> 



Re: Updates/Deletes/Upserts in Iceberg

2019-06-12 Thread Owen O'Malley


> On May 15, 2019, at 12:54 PM, Ryan Blue  wrote:
> 
> 2. Iceberg diff files should use synthetic keys
> 
> A lot of the discussion on the doc is about whether natural keys are 
> practical or what assumptions we can make or trade about them. In my opinion, 
> Iceberg tables will absolutely need natural keys for reasonable use cases. 
> And those natural keys will need to be unique. And Iceberg will need to rely 
> on engines to enforce that uniqueness.
> 
Agreed. One restriction that we should probably adopt is not allowing mutations 
of the natural keys or the partition/bucketing. Allowing such mutations while 
enforcing uniqueness would typically be expensive. Without those mutations, 
uniqueness is relatively cheap to enforce.

> But, there is a difference between table behavior and implementation. We can 
> use synthetic keys to implement the requirements of natural keys. Each row 
> should be identified by its file and position in a file. When deleting by a 
> natural key, we just need to find out what the synthetic key is and encode 
> that in the delete diff.
> 
+1

> With the physical representation using synthetic keys, we should also define 
> how to communicate a natural key constraint for a table. That way, writers 
> can fail if a write may violate the key constraints of a table.
> 
> 3. Synthetic keys should be based on filename and position
> 
> I think identifying the file in a synthetic key makes a lot of sense. This 
> would allow for delta file reuse as individual files are rewritten by a 
> “major” compaction and provides nice flexibility that fits with the format. 
> We will need to think through all the impacts, like how file relocation works 
> (e.g., move between regions) and the requirements for rewrites (must apply 
> the delta when rewriting).
> 
I’d recommend using the file and row number within the file. I believe Avro, 
ORC, and Parquet all track row numbers and thus they are a basically free 
synthetic id within each file. The critical feature is that each input split 
needs to know how many rows are above it in the file, so that delete files can 
be read efficiently.

.. Owen

> Open questions
> 
> There are also quite a few remaining questions for a design:
> 
> Should Iceberg use insert diff files? (My initial answer is no)
> Should Iceberg require diff compaction? Iceberg could require one delete diff 
> per partition, for example. (My answer: no)
> Should data files store synthetic key position? If so, why?
> Should there be a dense format for deletes, or just a sparse format?
> What is the scope of a delete diff? At a minimum, partition. But does it make 
> sense to build ways to restrict scope further?
> 
> On Fri, May 10, 2019 at 11:27 AM Anton Okolnychyi 
>  wrote:
> We did take a look at Hudi. The overall design seems to be pretty complicated 
> and, unfortunately, I didn’t have time to explore every detail.
> 
> Here is my understanding (correct me if I am wrong):
> 
> - Hudi has RECORD_KEY, which is expected to be unique.
> - Hudi has PRECOMBINED_KEY, which is used to pick only one row in the 
> incoming batch if there are multiple rows with the same key. As I understand, 
> this isn't used on reads. It is used on writes to deduplicate rows with 
> identical keys within one incoming batch. For example, if we are inserting 10 
> records and two rows have the same key, PRECOMBINED_KEY will be used to pick 
> up only one row.
> - Once Hudi ensures the uniqueness of RECORD_KEY within the incoming batch, 
> it loads the Bloom filter index from all existing Parquet files in the 
> involved partitions (meaning, partitions spread from the input batch) and 
> tags each record as either an update or insert by mapping the incoming keys 
> to existing files for updates. At this point, it seems to rely on join.
> 
> Is my understanding correct? If so, do we want to consider joins on write? We 
> mentioned this technique as one way to ensure the uniqueness of natural keys 
> but we were concerned about the performance. Also, does Hudi support 
> record-level updates? 
> 
> Thanks,
> Anton
> 
>> On 10 May 2019, at 18:22, Erik Wright > > wrote:
>> 
>> Thanks for putting this forward.
>> 
>> Another term for the "lazy" approach would be "merge on read".
>> 
>> My team has built something internally that uses merge-on-read internally 
>> but uses an "Eager" materialization for publication to Presto. Roughly, we 
>> maintain a table metadata file that looks a bit like Iceberg's and tracks 
>> the "live" version of each partition as it is updated over time. We are 
>> looking into a solution that will allow us to push the merge-on-read all the 
>> way to Presto (and other consumers), and adding Merge-On-Read to Iceberg is 
>> one of the approaches we are considering.
>> 
>> It's worth noting that Hudi does have support for upserts/deletes as well, 
>> so that's another model to consider.
>> 
>> On Fri, May 10, 2019 at 8:30 AM Miguel Miranda 

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Owen O'Malley
On Fri, May 24, 2019 at 8:28 PM Ryan Blue  wrote:

> if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an
> Iterator[InternalRow] interface, it would still not work right? Coz it
> seems to me there is a lot more going on upstream in the operator execution
> path that would be needed to be done here.
>
> There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an
> iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what
> we want to take advantage of. You’re right that the first thing that Spark
> does it to get each row as InternalRow. But we still get a benefit from
> vectorizing the data materialization to Arrow itself. Spark execution is
> not vectorized, but that can be updated in Spark later (I think there’s a
> proposal).
>
>From a performance viewpoint, this isn't a great solution. The row by row
approach will substantially hurt performance compared to the vectorized
reader. I've seen 30% or more speed up when removing row-by-row access. So
putting a row-by-row adapter in the middle of two vectorized
representations is pretty costly.

.. Owen


Re: [DISCUSS] Draft report for January 2019

2019-01-07 Thread Owen O'Malley
+1

On Mon, Jan 7, 2019 at 11:55 AM Ryan Blue  wrote:

> Dev list,
>
> We missed the initial report deadline, but I went ahead and drafted this
> anyway. Mentors can still sign off on this until end of day tomorrow, here:
> https://wiki.apache.org/incubator/January2019. Please have a look.
>
> rb
>
>
>
> Iceberg is a table format for large, slow-moving tabular data.
>
> Iceberg has been incubating since 2018-11-16.
>
> Three most important issues to address in the move towards graduation:
>
>   1. Finish the name clearance and trademark agreement.
>   2. Make the first Apache release. (
> https://github.com/apache/incubator-iceberg/milestone/1)
>   3. Grow the Iceberg community
>
> Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
> aware of?
>
>   * Gitbox traffic is now going to issues@. The community was losing dev@
> subscribers
> because of the high volume of traffic from Gitbox. However, now all
> updates are sent
> to issues@. It would be nice to have emails from creation go to dev@,
> while updates
> and resolutions would go the issues@.
>   * The trademark agreement proposed by Netflix was not acceptable to the
> ASF. It
> would be helpful if the ASF published the terms that the ASF requires
> to avoid trial
> and error. Netflix is drafting a new agreement.
>
> How has the community developed since the last report?
>
>   * Moved gitbox notifications to avoid loss of dev@ subscribers
> (self-reported leaving dev@).
>   * New contributor activity: 3 new issues opened, 4 PRs submitted
>   * 5 PRs from non-committers merged
>   * 2 contributors started reviewing PRs
>   * New design doc proposed by a community contributor
>   * Moved issues from Netflix repository to Apache repository
>
> How has the project developed since the last report?
>
>   * Planned blockers for first release, 0.1.0, in milestone 1
>   * Partial python implementation submitted
>   * Manifest listing file added to the spec and implementation committed
> (blocker for initial release). Resulted in a significant improvement in
> query planning time for large tables.
>   * Abstracted file IO API to support community use cases
>   * Reviewing community proposal for external plugins to support file-level
> encryption
>   * Added doc strings to schemas
>
> How would you assess the podling's maturity?
> Please feel free to add your own commentary.
>
>   [X] Initial setup (name clearance pending)
>   [X] Working towards first release
>   [ ] Community building
>   [ ] Nearing graduation
>   [ ] Other:
>
> Date of last release:
>
>   None yet
>
> When were the last committers or PPMC members elected?
>
>   None yet
>
> Have your mentors been helpful and responsive or are things falling
> through the cracks? In the latter case, please list any open issues
> that need to be addressed.
>
>   Last month was December, so traffic has been low and both PPMC members
> and
>   mentors were slow to respond. This is not abnormal, but the PPMC missed
> the
>   deadline to file this report. We will ensure this doesn't recur.
>
> Signed-off-by:
>
>   [X](iceberg) Ryan Blue
>  Comments: I wrote the first pass of the report, but after the
> deadline.
>   [ ](iceberg) Julien Le Dem
>  Comments:
>   [ ](iceberg) Owen O'Malley
>  Comments:
>   [ ](iceberg) James Taylor
>  Comments:
>   [ ](iceberg) Carl Steinbach
>  Comments:
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: project report

2018-12-04 Thread Owen O'Malley
Go ahead and edit the wiki

https://wiki.apache.org/incubator/December2018

I'd suggest that we do a source-only release and after the release passes,
push the binaries into Maven central.

On Tue, Dec 4, 2018 at 4:43 PM Ryan Blue  wrote:

> Looks good to me!
>
> We might want to note that the codebase has been updated with Apache
> license headers and now complies with ASF guidelines for a source release.
>
> We still need to cut over to org.apache package names instead of
> com.netflix, which I think we should do before the first release. We should
> also decide whether to do a source-only first release or to go through the
> pain of publishing convenience binaries with their own LICENSE and NOTICE
> content.
>
> rb
>
> On Tue, Dec 4, 2018 at 3:37 PM Owen O'Malley 
> wrote:
>
> > I wrote a first pass of the report for the Apache board.
> >
> > Iceberg
> > >
> > > Iceberg is a table format for large, slow-moving tabular data.
> > >
> > > Iceberg has been incubating since 2018-11-16.
> > >
> > > Three most important issues to address in the move towards graduation:
> > >
> > >   1. Get the SGA accepted.
> > >   2. Finish the name clearance.
> > >   3. Make the first Apache release.
> > >
> > > Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
> > > aware of?
> > >
> > >   * Gitbox integration has helped a lot, although it is frustrating
> that
> > > the team members are not allowed to configure the project and must
> go
> > > through infra for every change.
> > >   * The traffic on the dev list from Github pull requests and issues is
> > > pretty heavy. It would be nice to have emails from creation go to
> > dev@
> > > ,
> > > while updates and resolutions would go the issues@.
> > >
> > > How has the community developed since the last report?
> > >
> > >   This is the first report.
> > >
> > > How has the project developed since the last report?
> > >
> > >   This is the first report.
> > >
> > > How would you assess the podling's maturity?
> > > Please feel free to add your own commentary.
> > >
> > >   [X] Initial setup
> > >   [ ] Working towards first release
> > >   [ ] Community building
> > >   [ ] Nearing graduation
> > >   [ ] Other:
> > >
> > > Date of last release:
> > >
> > >   None yet
> > >
> > > When were the last committers or PPMC members elected?
> > >
> > >   None yet
> > >
> > > Have your mentors been helpful and responsive or are things falling
> > > through the cracks? In the latter case, please list any open issues
> > > that need to be addressed.
> > >
> > >   We're working through the issues as they come up.
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: merge-on-read?

2018-11-28 Thread Owen O'Malley
For Hive's ACID, we started with deltas that had three options per a row:
insert, delete, edit. Since that didn't enable predicate push down in the
common case where there are large number of inserts, we went to the model
of just using inserts and deletes in separate files. Queries that modifying
tables delete the old row and insert a new one. That allowed us to get good
performance for read, where it is most critical. There are some important
optimizations like for a small number of deletes, you can read all of the
deletes into memory and close that file.

.. Owen

On Wed, Nov 28, 2018 at 1:14 PM Erik Wright 
wrote:

> Those are both really neat use cases, but the one I had in mind was what
> Ryan mentioned. It's something that Hoodie apparently supports or is
> building support for, and it's an important use case for the systems that
> my colleagues and I are building.
>
> There are three scenarios:
>
>- An Extract system that is receiving updates/deletes from a source
>system. We wish to capture them as quickly as possible and make them
>available to users without having to restate the affected data files.
> The
>update patterns are not anything that can be addressed with
> partitioning.
>- A Transform platform that is running a graph of jobs. For some jobs
>that are rebuilt from scratch, we would like to compress the output
> without
>losing the history.
>- A Transform / Load system that is building tables on GCS and
>registering them in Hive for querying by Presto. This system is
>incrementally updating views, and while some of those views are
>event-oriented (with most updates clustered in recent history) some of
> them
>are not and in those cases there is not partitioning algorithm that will
>prevent us from updating virtually all partitions in every update.
>
> We have one example of an internal solution but would prefer something less
> bespoke. That system works as follows:
>
>1. For each dataset, unique key columns are defined.
>2. Datasets are partitioned (not necessarily by anything in the key).
>3. Upserts/deletes are captured in a mutation set.
>4. The mutation set is used to update affected partitions:
>   1. Identify the previous/new partition for each upserted/deleted row.
>   2. Open the affected partitions, drop all rows matching an
>   upserted/deleted key.
>   3. Append all upserts.
>   4. Write out the result.
>5. We maintain an index (effectively an Iceberg snapshot) that says
>which partitions come from where (we keep the ones that are unaffected
> from
>the previous dataset version and add in the updated ones).
>
> This data is loaded into Presto and our current plan is to update it by
> registering a view in Presto that applies recent mutation sets to the
> latest merged version on the fly.
>
> So to build this in Iceberg we would likely need to extend the Table spec
> with:
>
>- An optional unique key specification, possibly composite, naming one
>or more columns for which there is expected to be at most one row per
>unique value.
>- The ability to indicate in the snapshot that a certain set of
>manifests are "base" data while other manifests are "diffs".
>- The ability in a "diff" manifest to indicate files that contain
>"deleted" keys (or else the ability in a given row to have a special
> column
>that indicates that the row is a "delete" and not an "upsert")
>- "diff" manifests would need to be ordered in the snapshot (as multiple
>"diff" manifests could affect a single row and only the latest of those
>takes effect).
>
> Obviously readers would need to be updated to correctly interpret this
> data. And there is all kinds of supporting work that would be required in
> order to maintain these (periodically collapsing diffs into the base,
> etc.).
>
> Is this something for which PRs would be accepted, assuming all of the
> necessary steps to make sure the direction is compatible with Iceberg's
> other use-cases?
>
> On Wed, Nov 28, 2018 at 1:14 PM Owen O'Malley 
> wrote:
>
> > I’m not sure what use case Erik is looking for, but I’ve had users that
> > want to do the equivalent of HBase’s column families. They want some of
> the
> > columns to be stored separately and the merged together on read. The
> > requirements would be that there is a 1:1 mapping between rows in the
> > matching files and stripes.
> >
> > It would look like:
> >
> > file1.orc: struct file2.orc:
> > struct
> >
> > It would let them leave the stable information and only re-write the
> >

Re: merge-on-read?

2018-11-28 Thread Owen O'Malley
I’m not sure what use case Erik is looking for, but I’ve had users that want to 
do the equivalent of HBase’s column families. They want some of the columns to 
be stored separately and the merged together on read. The requirements would be 
that there is a 1:1 mapping between rows in the matching files and stripes.

It would look like:

file1.orc: struct file2.orc: 
struct

It would let them leave the stable information and only re-write the second 
column family when the information in the mutable column family changes. It 
would also support use cases where you add data enrichment columns after the 
data has been ingested. 

From there it is easy to imagine having a replace operator where file2’s 
version of a column replaces file1’s version. 

.. Owen

> On Nov 28, 2018, at 9:44 AM, Ryan Blue  wrote:
> 
> What do you mean by merge on read?
> 
> A few people I've talked to are interested in building delete and upsert
> features. Those would create files that track the changes, which would be
> merged at read time to apply them. Is that what you mean?
> 
> rb
> 
> On Tue, Nov 27, 2018 at 12:26 PM Erik Wright
>  wrote:
> 
>> Has any consideration been given to the possibility of eventual
>> merge-on-read support in the Iceberg table spec?
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix



Issue list?

2018-11-27 Thread Owen O'Malley
All,
   As we move over to Apache infrastructure, we need to decide what works for 
the community. The dev list is getting a lot of traffic and is probably 
intimidating to new comers. Currently the notices are:

Pull Requests and issue creation/comment/close -> dev@
Git commit -> commits@

One pattern that works for Hive is to do:

PR and issue creation -> dev@
PR and issue comments/close -> issues@
Git commit -> commits@

I don’t know what options Apache Infra gives us for notifications, but we could 
ask. What do people think?

.. Owen