Re: Iceberg community sync - 2020-03-25

OpenInx Sat, 28 Mar 2020 01:28:16 -0700

> Ryan has concerns about blogs in docs - why not link to blogs on other
platforms? We don’t want content to get stale or have the community
“reviewing” content.
I mean we could create a page to collect all the design doc links first.
The stale content is indeed a problem unless we update the doc for each
relative change. I don't have the strong opinion about the reviewing
comments :-)


> Ryan: we’ll need reviewers because I’m not qualified. Will reach out to
Steven Wu (Netflix sink author) and other people interested in Flink.
Steven did a great job, he's the perfect reviewer if he has the bandwidth.
There're some flink committers and PMC in our flink team, we could also
ping them.

> Openinx brought up concerns about minimizing end-to-end latency
Agreed that we could implement the file/pos deletes and equality-deletes
firstly. The off-line optimization seems reasonable, we also have an
internal discussion about the e2e latency and have some ideas to minimize
it, maybe I could provide a simple doc to describe the idea. Anyway we
could push the file/pos and equality deletes forward first.

On Sat, Mar 28, 2020 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi everyone,
>
> Here are my notes from the discussion. These are based mainly on my
> memory, so feel free to correct or expand if you think it can be improved.
> Thanks!
>
> *Agenda*
>
>    - Cadence for syncs - every 2-4 weeks?
>    - 0.8.0 Java release
>    - Community building
>    - Flink source and sink status
>    - MR formats and Hive support status
>    - Security (authorization, data values in metadata)
>    - Row-level deletes (main discussion)
>
> *Discussion*:
>
>    - Sync cadence
>       - Ryan: with syncs alternating time zones, 4 weeks is too long, but
>       2 weeks is a lot for those of us attending all of them. How about 3 
> weeks?
>       - Consensus was every 3 weeks
>    - 0.8.0 Java release
>       - When should we target for the release? Consensus was for
>       Mid-April (3 weeks)
>       - What do we want in the release? Main outstanding features are ORC
>       support, Parquet vectorized reads, Spark/Hive changes
>       - Ideally will include ORC support, since it is close
>       - Hive version is 2.3 and should not block Hive work
>       - Vectorized reads are nice-to-have but should not block a release
>       - Can we disable consistent versions for Spark 2.4 and Spark 3.0
>       support in the same repo? Ryan will dig up build script with baseline
>       applied to only some modules, maybe we can disable it
>    - Community building
>       - Saisai suggested a Powered By page where we can post who is using
>       Iceberg in production. Great idea!
>       - Openinx suggested a blog section of the docs site
>       - Ryan has concerns about blogs in docs - why not link to blogs on
>       other platforms? We don’t want content to get stale or have the 
> community
>       “reviewing” content.
>       - Owen: some blogs break links
>    - Flink source and sinks status
>       - Tencent data lake team posted a sink based on Netflix skunkworks,
>       but needs to remove Netflix-specific features/dependencies
>       - Issues opened for work to get sink in
>       - Ryan: we’ll need reviewers because I’m not qualified. Will reach
>       out to Steven Wu (Netflix sink author) and other people interested in 
> Flink.
>       - Ryan: the Spark source is coming along, but the hardest part is
>       getting a stream of files to process from table state. Is that 
> something we
>       want to share between Spark and Flink implementations?
>       - Probably want to share, if possible
>    - Skipped MR/Hive status and security (will start dev list thread) to
>    get to row-level deletes
>    - Row-level deletes roadmap:
>       - Ryan will be working on this more, with a doc for Spark MERGE
>       INTO interfaces coming soon
>       - This has been moving slowly because some parts, like sequence
>       numbers, require forward-breaking/v2 changes
>       - Owen suggested building two parallel write paths to be able to
>       write v1. Everyone agreed with this
>       - There are several projects that can be done by anyone and do not
>       require forward-breaking/v2 changes: delete file format readers, 
> writers,
>       record iterator implementations to merge deletes (set-based, 
> merge-based),
>       and specs for these once they are built
>       - Junjie offered to work on file/position delete files
>       - Equality delete merges are blocked on sort order addition to the
>       format
>       - Main blocking decision point is how to track delete files in
>       manifests, Ryan will start a dev list thread
>       - Openinx brought up concerns about minimizing end-to-end latency
>       for a use case with high write volume for equality deletes
>       - Ryan’s response was that this will likely require off-line
>       optimization: write equality deletes from Flink but rewrite in a more
>       efficient format (sorted, translated to file/position, etc.) in a 
> separate
>       service. Enabling these services is the role of Iceberg, which is an
>       at-rest format. Other approaches put this complexity into the writer, 
> but
>       it has to be done somewhere.
>       - Gautam: what about GDPR deletes?
>       - Ryan: GDPR deletes are a simpler case, where volume is much
>       lower. That brings us back to the roadmap: let’s focus on simpler
>       end-to-end use cases and get those done. Then we can work on scaling 
> them.
>       First things are to get the formats defined and documented, get a 
> set-based
>       delete filter implementation for equality deletes and a merge-based one 
> for
>       file/position deletes, and to add sequence numbers.
>    - Thanks to everyone that attended! Will schedule the next sync for 3
>    weeks from now.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg community sync - 2020-03-25

Reply via email to