Re: Meeting Minutes from 10/20 Iceberg Sync

OpenInx Thu, 21 Oct 2021 23:35:28 -0700

Thanks for the detailed report !

One more thing:  We now have made a lot of progress in integrating Alibaba
Cloud (https://www.aliyun.com/), Please see
https://github.com/apache/iceberg/projects/21 (Thanks @xingbowu -
https://github.com/xingbowu).


On Thu, Oct 21, 2021 at 11:30 PM Sam Redai <s...@tabular.io> wrote:

> Good Morning Everyone,
>
> Here are the minutes from our Iceberg Sync that took place on October
> 20th, 9am-10am PT. Please remember that anyone can join the discussion so
> feel free to share the Iceberg-Sync
> <https://groups.google.com/g/iceberg-sync> google group with anyone who
> is seeking an invite. As usual, the notes and the agenda are posted in the 
> live
> doc
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
>  that's
> also attached to the meeting invitation.
>
> We covered a lot of topics...here we go!:
>
> Top of the Meeting Highlights
>
>    -
>
>    Sort based compaction - This is finished, reviewed, and merged. When
>    you compact data files, you can now also have Spark re-sort them, either by
>    the table’s sort order or the sort order given when you create the
>    compaction job.
>    -
>
>    Spark build refactor: Thank you to Jack for getting us started on the
>    Spark build refactor and also thanks to Anton for reviewing and helping get
>    these changes in. We’ve gone with a variant of option 3 from our last
>    discussions where we include all of the spark modules in our build but make
>    it easy to turn them off. This way we can get the CI to run Spark, Hive,
>    and Flink tests separately and only if necessary.
>    -
>
>    Delete files implementation for ORC: Thanks to Peter for adding
>    builders to store deletes in ORC (previously we could only store deletes in
>    Parquet or Avro). This means we now have support for all 3 formats for this
>    feature.
>    -
>
>    Flink Update: We’ve updated Flink to 1.13 so we’re back on a supported
>    version. 1.14 is out this week so we can aim to move to that at some point.
>
> Iceberg 0.12.1 Upcoming Patch Release (milestone
> <https://github.com/apache/iceberg/milestone/15?closed=1>)
>
>    -
>
>    Fix for the parquet map projection bug
>    -
>
>    Fix Flink CDC bug
>    -
>
>    A few other fixes that we also want to get out to the community so
>    we’re going to start a release candidate as soon as possible
>    -
>
>    Kyle will start a thread in the general slack channel so everyone
>    please feel free to mention any additional fixes that they want to see in
>    this patch release
>
> Snapshot Releases
>
>    -
>
>    Eduard will tackle adding snapshot releases
>    -
>
>    In our deploy.gradle file, it’s setup to deploy to the snapshot
>    repository
>    -
>
>    May require certain credentials so it may be required to reach out to
>    the ASF infrastructure team
>
> Iceberg 0.13.0 Upcoming Release
>
>    -
>
>    There’s agreement to switch to a time based release schedule so the
>    next release is roughly mid-November
>    -
>
>    Jack will cut a branch close to that time and any features that aren’t
>    in yet will be pushed to the next release
>    -
>
>    We agree not to hold up releases to squeeze features in and prefer
>    instead to aim for releasing sooner the next time
>
> Adding v3.2 to Spark Build Refactoring
>
>    -
>
>    Russell and Anton will coordinate on dropping in a Spark 3.2 module
>    -
>
>    We currently have 3.1 in the `spark3` module. We’ll move that out to
>    its own module and mirror what we do with the 3.2 module. (This will enable
>    cleaning up some mixed 3.0/3.1 code)
>
> Merge on Read
>
>    -
>
>    Anton has a bunch of PRs ready to queue up to contribute their
>    internal implementation. (Russell will work with him)
>    -
>
>    This feature will allow for a much lower write amplification
>    -
>
>    The expectation is that in Spark 3.3 we can rely on Spark’s internal
>    merge on read
>
> Snapshot Tagging (design doc
> <https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit>)
> (PR #3104 <https://github.com/apache/iceberg/pull/3104>)
>
>    -
>
>    We just had a meeting on Monday about that and made some conclusions
>    and designs, so anyone who is interested please take a look.
>    -
>
>    Next steps are to add the feature in the stack and Jack already has a
>    WIP implementation into the table metadata class
>
> Delete Compaction (design doc
> <https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg>
> )
>
>    -
>
>    Discussion happening at 5pm ET on 10/21 5-6pm PT for anyone interested
>    (meeting link <https://meet.google.com/nxx-nnvj-omx>)
>    -
>
>    Some more discussion is needed to hone in on a final design choice.
>    There are a few options that each have their own pros and cons.
>
> The New Source Interface for Flink (FLIP-27
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface>
> )
>
>    -
>
>    Eventually everything will move to this new source interface (Kafka is
>    already using this and it will be the default in Flink 1.14)
>    -
>
>    A few PR’s for Iceberg are out there and are pending review and merge
>    (may not make the deadline for the next release but that’s ok)
>
> Encryption MVP
>
>    -
>
>    Just had a recent sync on this and we're currently waiting on a few
>    updates to the design
>    -
>
>       Flesh out how the new pushdown encryption into ORC and Parquet will
>       work
>       -
>
>       Need some people to review the stream based encryption,
>       particularly around splitability
>       -
>
>    A few offline discussions are currently happening and for the
>    interface we are expecting a few additional PRs separate from the main
>    encryption MVP PR
>
> Python Library Development
>
>    -
>
>    The high level design discussions have been concluded recently
>    -
>
>    We’ll delay the top level API discussions until some of the core is
>    implemented
>    -
>
>    We have a collection of issues created and a handful of engineers
>    working on it
>
> Iceberg Docsite Refactoring
>
>    -
>
>    Large refactoring coming for the Iceberg docsite
>    -
>
>       Versioned docs (In the future need to decide how to represent the
>       python versions)
>       -
>
>       Organized more by the persona of the visitor (Data Engineer,
>       Systems Engineer, etc.)
>       -
>
>       Searchable
>       -
>
>    Expect a PR from Sam, ready for review by the end of this week or
>    early next week
>
> Row-Level Support in the Vectorized Reader (PR #3141
> <https://github.com/apache/iceberg/issues/3141>)
>
>    -
>
>    Yufei is working on this and it’s part of the effort for merge on read
>    -
>
>    PR #3287 <https://github.com/apache/iceberg/pull/3287> is only for the
>    position delete in parquet
>    -
>
>    We should have something ready to add by next week
>
> View Spec (PR #3188 <https://github.com/apache/iceberg/pull/3188>)
>
>    -
>
>    There was a discussion on if we should we have just the SQL text
>    exactly as it was passed to the engine or should we also include the parsed
>    and analyze plan (includes column resolution). In theory, the resolved SQL
>    text should be very useful but it’s usefulness may be limited to certain
>    edge cases.
>    -
>
>    The broader discussion here is: Should we allow having multiple
>    dialects (Trino, Spark, etc..)
>    -
>
>       Adds complexity
>       -
>
>       Time traveling needs to be considered. What does time traveling a
>       view mean? If the underlying table is an Iceberg table we may be able 
> to,
>       but even that would require “as of” time travel to allow time travel 
> across
>       multiple tables.
>       -
>
>       Time traveling schemas needs to be added
>       -
>
>    Agreement that we should not try to solve everything at once but break
>    this into smaller problems.
>    -
>
>    Let’s keep an eye on upcoming engine features to see if this will be
>    implicitly solved and let's also refrain from over engineering this.
>
> That's it! Thanks everyone for the high level of participation and enjoy
> the rest of your week!
>
>

Re: Meeting Minutes from 10/20 Iceberg Sync

Reply via email to