Thanks for the writing. The views from Netflix branch is a great feature, would have any plan to port to Apache Iceberg ?
On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <[email protected]> wrote: > Here are my notes from yesterday’s sync. As usual, feel free to add to > this if I missed something. > > There were a couple of questions raised during the sync that we’d like to > open up to anyone who wasn’t able to attend: > > - Should we wait for the parallel metadata rewrite action before > cutting 0.8.0 candidates? > - Should we wait for ORC metrics before cutting 0.8.0 candidates? > > In the sync, we thought that it would be good to wait and get these in. > Please reply to this if you agree or disagree. > > Thanks! > > *Attendees*: > > - Ryan Blue > - Dan Weeks > - Anjali Norwood > - Jun Ma > - Ratandeep Ratti > - Pavan > - Christine Mathiesen > - Gautam Kowshik > - Mass Dosage > - Filip > - Ryan Murray > > *Topics*: > > - 0.8.0 release blockers: actions, ORC metrics > - Row-level delete update > - Parquet vectorized read update > - InputFormats and Hive support > - Netflix branch > > *Discussion*: > > - 0.8.0 release > - Ryan: we planned to get a candidate out this week, but I think we > may want to wait on 2 things that are about ready > - First: Anton is contributing an action to rewrite manifests in > parallel that is close. Anyone interested? (Gautam is interested) > - Second: ORC is passing correctness tests, but doesn’t have > column-level metrics. Should we wait for this? > - Ratandeep: ORC also lacks predicate push-down support > - Ryan: I think metrics are more important than PPD because PPD is > task side and metrics help reduce the number of tasks. If we wait on > one, > I’d prefer to wait on metrics > - Ratandeep will look into whether he or Shardul can work on this > - General consensus was to wait for these features before getting a > candidate out > - Row-level deletes > - Good progress in several PRs on adding the parallel v2 write > path, as Owen suggested last sync > - Junjie contributed an update to the spec for file/position delete > files > - Parquet vectorized read > - Dan: flat schema reads are primarily waiting on reviews > - Dan: is anyone interested in complex type support? > - Gautam needs struct and map support. 0.14.0 doesn’t support maps > - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but not > maps of structs > - Ryan (Blue): Because we have a translation layer in Iceberg to > pass off to Spark, we don’t actually need support in Arrow. We are > currently stuck on 0.14.0 because of changes that prevent us from > avoiding > a null check (see this comment > <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500> > ) > - > > InputFormat and Hive support > - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig > and Hive; need to start working on the serde side and building a Hive > StorageHandler, missing DDL support > - Ryan: What DDL support? > - > > Ratandeep: Statements like ADD PARTITION > - > > Ryan: How would all of this work in Hive? It isn’t clear what > components are needed right now: StorageHandler? RawStore? HiveMetaHook? > - Ratandeep: Currently working on only the read path, which > requires a StorageHandler. The write path would be more difficult. > - Mass Dosage: Working on a (mapred) InputFormat for Hive in > iceberg-mr, started working on a serde in iceberg-hive. Interested in > writes, but not in the short or medium term > - Mass Dosage: The main problem is dependency conflicts between > Hive and Iceberg, mainly Guava > - Ryan: Anyone know a good replacement for Guava collections? > - Ryan: In Avro, we have a module that shades Guava > > <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml> > and has a class with references > > <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>. > Then shading can minimize the shaded classes. We could do that here. > - Ryan: Is Jackson also a problem? > - Mass Dosage: Yes, and calcite > - Ryan: Calcite probably isn’t referenced directly so we can > hopefully avoid the consistent versions problem by excluding it > - Netflix branch of Iceberg (with non-Iceberg additions) > - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based > branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4> > for Spark 2.4 with DSv2 backported > <https://github.com/Netflix/spark> > - The purpose of this is to share non-Iceberg work that we use to > compliment Iceberg, like views, catalogs, and DSv2 tables > - Views are SQL views > > <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view> > that are stored and versioned like Iceberg metadata. This is how we are > tracking views for Presto and Spark (coral integration would be nice!). > We > are contributing the Spark DSv2 ViewCatalog to upstream Spark > - Metacat is an open metastore project from Netflix. The metacat > package contains our metastore integration > > <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat> > for it. > - The batch package contains Spark and Hive table implementations > for Spark’s DSv2 > > <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>, > which we use for multi-catalog support. > - Gautam: how will migration to Iceberg’s v2 format work for those of > us in production using v1? > - Ryan: Tables are explicitly updated to v2 and both v1 and v2 will > be supported in parallel. Using v1 until everything is updated with v2 > support takes care of forward-compatibility issues. This can be done on > a > per-table basis > - Gautam: Does migration require rewriting metadata? > - Ryan: No, the format is backward compatible with v1, so the > update is metadata-only until the writers start using new metadata that > v1 > would ignore (deletes) and would incorrectly modify if it were to write > to > v2. > - Ryan: Also, Iceberg already has a forward-compatibility check > that will prevent v1 readers from loading a v2 table. > > -- > Ryan Blue > Software Engineer > Netflix >
