> Ryan has concerns about blogs in docs - why not link to blogs on other platforms? We don’t want content to get stale or have the community “reviewing” content. I mean we could create a page to collect all the design doc links first. The stale content is indeed a problem unless we update the doc for each relative change. I don't have the strong opinion about the reviewing comments :-)
> Ryan: we’ll need reviewers because I’m not qualified. Will reach out to Steven Wu (Netflix sink author) and other people interested in Flink. Steven did a great job, he's the perfect reviewer if he has the bandwidth. There're some flink committers and PMC in our flink team, we could also ping them. > Openinx brought up concerns about minimizing end-to-end latency Agreed that we could implement the file/pos deletes and equality-deletes firstly. The off-line optimization seems reasonable, we also have an internal discussion about the e2e latency and have some ideas to minimize it, maybe I could provide a simple doc to describe the idea. Anyway we could push the file/pos and equality deletes forward first. On Sat, Mar 28, 2020 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi everyone, > > Here are my notes from the discussion. These are based mainly on my > memory, so feel free to correct or expand if you think it can be improved. > Thanks! > > *Agenda* > > - Cadence for syncs - every 2-4 weeks? > - 0.8.0 Java release > - Community building > - Flink source and sink status > - MR formats and Hive support status > - Security (authorization, data values in metadata) > - Row-level deletes (main discussion) > > *Discussion*: > > - Sync cadence > - Ryan: with syncs alternating time zones, 4 weeks is too long, but > 2 weeks is a lot for those of us attending all of them. How about 3 > weeks? > - Consensus was every 3 weeks > - 0.8.0 Java release > - When should we target for the release? Consensus was for > Mid-April (3 weeks) > - What do we want in the release? Main outstanding features are ORC > support, Parquet vectorized reads, Spark/Hive changes > - Ideally will include ORC support, since it is close > - Hive version is 2.3 and should not block Hive work > - Vectorized reads are nice-to-have but should not block a release > - Can we disable consistent versions for Spark 2.4 and Spark 3.0 > support in the same repo? Ryan will dig up build script with baseline > applied to only some modules, maybe we can disable it > - Community building > - Saisai suggested a Powered By page where we can post who is using > Iceberg in production. Great idea! > - Openinx suggested a blog section of the docs site > - Ryan has concerns about blogs in docs - why not link to blogs on > other platforms? We don’t want content to get stale or have the > community > “reviewing” content. > - Owen: some blogs break links > - Flink source and sinks status > - Tencent data lake team posted a sink based on Netflix skunkworks, > but needs to remove Netflix-specific features/dependencies > - Issues opened for work to get sink in > - Ryan: we’ll need reviewers because I’m not qualified. Will reach > out to Steven Wu (Netflix sink author) and other people interested in > Flink. > - Ryan: the Spark source is coming along, but the hardest part is > getting a stream of files to process from table state. Is that > something we > want to share between Spark and Flink implementations? > - Probably want to share, if possible > - Skipped MR/Hive status and security (will start dev list thread) to > get to row-level deletes > - Row-level deletes roadmap: > - Ryan will be working on this more, with a doc for Spark MERGE > INTO interfaces coming soon > - This has been moving slowly because some parts, like sequence > numbers, require forward-breaking/v2 changes > - Owen suggested building two parallel write paths to be able to > write v1. Everyone agreed with this > - There are several projects that can be done by anyone and do not > require forward-breaking/v2 changes: delete file format readers, > writers, > record iterator implementations to merge deletes (set-based, > merge-based), > and specs for these once they are built > - Junjie offered to work on file/position delete files > - Equality delete merges are blocked on sort order addition to the > format > - Main blocking decision point is how to track delete files in > manifests, Ryan will start a dev list thread > - Openinx brought up concerns about minimizing end-to-end latency > for a use case with high write volume for equality deletes > - Ryan’s response was that this will likely require off-line > optimization: write equality deletes from Flink but rewrite in a more > efficient format (sorted, translated to file/position, etc.) in a > separate > service. Enabling these services is the role of Iceberg, which is an > at-rest format. Other approaches put this complexity into the writer, > but > it has to be done somewhere. > - Gautam: what about GDPR deletes? > - Ryan: GDPR deletes are a simpler case, where volume is much > lower. That brings us back to the roadmap: let’s focus on simpler > end-to-end use cases and get those done. Then we can work on scaling > them. > First things are to get the formats defined and documented, get a > set-based > delete filter implementation for equality deletes and a merge-based one > for > file/position deletes, and to add sequence numbers. > - Thanks to everyone that attended! Will schedule the next sync for 3 > weeks from now. > > -- > Ryan Blue > Software Engineer > Netflix >