Thanks for the Correction Adrian. I've filed the ticket for github here: https://github.com/apache/incubator-iceberg/issues/934 . There are 2 approaches mentioned there with pros/cons. Will be good to get the community's feedback on how to proceed.
-best, R. On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage <massdos...@gmail.com> wrote: > Thanks for the detailed notes Ryan. My thoughts on a few of the topics... > > 0.8.0 release - my general preference is to release early and release > often. If features aren't ready why wait? Why not go with a 0.8.0 release > now and then a 0.9.0 (or whatever) a couple of weeks later with the other > features? I know with Apache projects this can sometimes be a challenge > with all the ceremony around a release, getting votes etc. but I don't > think that's such a problem in the incubating stage? > > A clarification on the InputFomats - I think the DDL Ratandeep was > referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS" > i.e. the "read" path but for statements other than "SELECT" etc. Also, to > be clear the `mapreduce` InputFormat that was contributed - it sounds like > that works for Pig but I don't think it will work for Hive 1 or 2 since > they use the `mapred` API for InputFormats. This is what we have attempted > to cover in our InputFormat. I raised a WIP PR for it yesterday at > https://github.com/apache/incubator-iceberg/pull/933 and would appreciate > feedback from anyone interested in it. > > Thanks for sharing the Avro hack for shading and relocating Guava. Should > I create a ticket on GitHub to capture this work? We'll then have a go at > implementing it. > > Thanks, > > Adrian > > > On Fri, 17 Apr 2020 at 04:07, OpenInx <open...@gmail.com> wrote: > >> Thanks for the writing. >> The views from Netflix branch is a great feature, would have any plan to >> port to Apache Iceberg ? >> >> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Here are my notes from yesterday’s sync. As usual, feel free to add to >>> this if I missed something. >>> >>> There were a couple of questions raised during the sync that we’d like >>> to open up to anyone who wasn’t able to attend: >>> >>> - Should we wait for the parallel metadata rewrite action before >>> cutting 0.8.0 candidates? >>> - Should we wait for ORC metrics before cutting 0.8.0 candidates? >>> >>> In the sync, we thought that it would be good to wait and get these in. >>> Please reply to this if you agree or disagree. >>> >>> Thanks! >>> >>> *Attendees*: >>> >>> - Ryan Blue >>> - Dan Weeks >>> - Anjali Norwood >>> - Jun Ma >>> - Ratandeep Ratti >>> - Pavan >>> - Christine Mathiesen >>> - Gautam Kowshik >>> - Mass Dosage >>> - Filip >>> - Ryan Murray >>> >>> *Topics*: >>> >>> - 0.8.0 release blockers: actions, ORC metrics >>> - Row-level delete update >>> - Parquet vectorized read update >>> - InputFormats and Hive support >>> - Netflix branch >>> >>> *Discussion*: >>> >>> - 0.8.0 release >>> - Ryan: we planned to get a candidate out this week, but I think >>> we may want to wait on 2 things that are about ready >>> - First: Anton is contributing an action to rewrite manifests in >>> parallel that is close. Anyone interested? (Gautam is interested) >>> - Second: ORC is passing correctness tests, but doesn’t have >>> column-level metrics. Should we wait for this? >>> - Ratandeep: ORC also lacks predicate push-down support >>> - Ryan: I think metrics are more important than PPD because PPD >>> is task side and metrics help reduce the number of tasks. If we wait >>> on >>> one, I’d prefer to wait on metrics >>> - Ratandeep will look into whether he or Shardul can work on this >>> - General consensus was to wait for these features before getting >>> a candidate out >>> - Row-level deletes >>> - Good progress in several PRs on adding the parallel v2 write >>> path, as Owen suggested last sync >>> - Junjie contributed an update to the spec for file/position >>> delete files >>> - Parquet vectorized read >>> - Dan: flat schema reads are primarily waiting on reviews >>> - Dan: is anyone interested in complex type support? >>> - Gautam needs struct and map support. 0.14.0 doesn’t support maps >>> - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but >>> not maps of structs >>> - Ryan (Blue): Because we have a translation layer in Iceberg to >>> pass off to Spark, we don’t actually need support in Arrow. We are >>> currently stuck on 0.14.0 because of changes that prevent us from >>> avoiding >>> a null check (see this comment >>> >>> <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500> >>> ) >>> - >>> >>> InputFormat and Hive support >>> - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for >>> Pig and Hive; need to start working on the serde side and building a >>> Hive >>> StorageHandler, missing DDL support >>> - Ryan: What DDL support? >>> - >>> >>> Ratandeep: Statements like ADD PARTITION >>> - >>> >>> Ryan: How would all of this work in Hive? It isn’t clear what >>> components are needed right now: StorageHandler? RawStore? >>> HiveMetaHook? >>> - Ratandeep: Currently working on only the read path, which >>> requires a StorageHandler. The write path would be more difficult. >>> - Mass Dosage: Working on a (mapred) InputFormat for Hive in >>> iceberg-mr, started working on a serde in iceberg-hive. Interested in >>> writes, but not in the short or medium term >>> - Mass Dosage: The main problem is dependency conflicts between >>> Hive and Iceberg, mainly Guava >>> - Ryan: Anyone know a good replacement for Guava collections? >>> - Ryan: In Avro, we have a module that shades Guava >>> >>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml> >>> and has a class with references >>> >>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>. >>> Then shading can minimize the shaded classes. We could do that here. >>> - Ryan: Is Jackson also a problem? >>> - Mass Dosage: Yes, and calcite >>> - Ryan: Calcite probably isn’t referenced directly so we can >>> hopefully avoid the consistent versions problem by excluding it >>> - Netflix branch of Iceberg (with non-Iceberg additions) >>> - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based >>> branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4> >>> for Spark 2.4 with DSv2 backported >>> <https://github.com/Netflix/spark> >>> - The purpose of this is to share non-Iceberg work that we use to >>> compliment Iceberg, like views, catalogs, and DSv2 tables >>> - Views are SQL views >>> >>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view> >>> that are stored and versioned like Iceberg metadata. This is how we >>> are >>> tracking views for Presto and Spark (coral integration would be >>> nice!). We >>> are contributing the Spark DSv2 ViewCatalog to upstream Spark >>> - Metacat is an open metastore project from Netflix. The metacat >>> package contains our metastore integration >>> >>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat> >>> for it. >>> - The batch package contains Spark and Hive table implementations >>> for Spark’s DSv2 >>> >>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>, >>> which we use for multi-catalog support. >>> - Gautam: how will migration to Iceberg’s v2 format work for those >>> of us in production using v1? >>> - Ryan: Tables are explicitly updated to v2 and both v1 and v2 >>> will be supported in parallel. Using v1 until everything is updated >>> with v2 >>> support takes care of forward-compatibility issues. This can be done >>> on a >>> per-table basis >>> - Gautam: Does migration require rewriting metadata? >>> - Ryan: No, the format is backward compatible with v1, so the >>> update is metadata-only until the writers start using new metadata >>> that v1 >>> would ignore (deletes) and would incorrectly modify if it were to >>> write to >>> v2. >>> - Ryan: Also, Iceberg already has a forward-compatibility check >>> that will prevent v1 readers from loading a v2 table. >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >>