Re: Iceberg community sync notes - 15 April 2020

OpenInx Thu, 16 Apr 2020 20:08:13 -0700

Thanks for the writing.
The views from Netflix branch is a great feature, would have any plan to
port to Apache Iceberg ?


On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Here are my notes from yesterday’s sync. As usual, feel free to add to
> this if I missed something.
>
> There were a couple of questions raised during the sync that we’d like to
> open up to anyone who wasn’t able to attend:
>
>    - Should we wait for the parallel metadata rewrite action before
>    cutting 0.8.0 candidates?
>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>
> In the sync, we thought that it would be good to wait and get these in.
> Please reply to this if you agree or disagree.
>
> Thanks!
>
> *Attendees*:
>
>    - Ryan Blue
>    - Dan Weeks
>    - Anjali Norwood
>    - Jun Ma
>    - Ratandeep Ratti
>    - Pavan
>    - Christine Mathiesen
>    - Gautam Kowshik
>    - Mass Dosage
>    - Filip
>    - Ryan Murray
>
> *Topics*:
>
>    - 0.8.0 release blockers: actions, ORC metrics
>    - Row-level delete update
>    - Parquet vectorized read update
>    - InputFormats and Hive support
>    - Netflix branch
>
> *Discussion*:
>
>    - 0.8.0 release
>       - Ryan: we planned to get a candidate out this week, but I think we
>       may want to wait on 2 things that are about ready
>       - First: Anton is contributing an action to rewrite manifests in
>       parallel that is close. Anyone interested? (Gautam is interested)
>       - Second: ORC is passing correctness tests, but doesn’t have
>       column-level metrics. Should we wait for this?
>       - Ratandeep: ORC also lacks predicate push-down support
>       - Ryan: I think metrics are more important than PPD because PPD is
>       task side and metrics help reduce the number of tasks. If we wait on 
> one,
>       I’d prefer to wait on metrics
>       - Ratandeep will look into whether he or Shardul can work on this
>       - General consensus was to wait for these features before getting a
>       candidate out
>    - Row-level deletes
>       - Good progress in several PRs on adding the parallel v2 write
>       path, as Owen suggested last sync
>       - Junjie contributed an update to the spec for file/position delete
>       files
>    - Parquet vectorized read
>       - Dan: flat schema reads are primarily waiting on reviews
>       - Dan: is anyone interested in complex type support?
>       - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but not
>       maps of structs
>       - Ryan (Blue): Because we have a translation layer in Iceberg to
>       pass off to Spark, we don’t actually need support in Arrow. We are
>       currently stuck on 0.14.0 because of changes that prevent us from 
> avoiding
>       a null check (see this comment
>       <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>       )
>    -
>
>    InputFormat and Hive support
>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig
>       and Hive; need to start working on the serde side and building a Hive
>       StorageHandler, missing DDL support
>       - Ryan: What DDL support?
>       -
>
>       Ratandeep: Statements like ADD PARTITION
>       -
>
>       Ryan: How would all of this work in Hive? It isn’t clear what
>       components are needed right now: StorageHandler? RawStore? HiveMetaHook?
>       - Ratandeep: Currently working on only the read path, which
>       requires a StorageHandler. The write path would be more difficult.
>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>       writes, but not in the short or medium term
>       - Mass Dosage: The main problem is dependency conflicts between
>       Hive and Iceberg, mainly Guava
>       - Ryan: Anyone know a good replacement for Guava collections?
>       - Ryan: In Avro, we have a module that shades Guava
>       
> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>       and has a class with references
>       
> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>       Then shading can minimize the shaded classes. We could do that here.
>       - Ryan: Is Jackson also a problem?
>       - Mass Dosage: Yes, and calcite
>       - Ryan: Calcite probably isn’t referenced directly so we can
>       hopefully avoid the consistent versions problem by excluding it
>    - Netflix branch of Iceberg (with non-Iceberg additions)
>       - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
>       branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
>       for Spark 2.4 with DSv2 backported
>       <https://github.com/Netflix/spark>
>       - The purpose of this is to share non-Iceberg work that we use to
>       compliment Iceberg, like views, catalogs, and DSv2 tables
>       - Views are SQL views
>       
> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>       that are stored and versioned like Iceberg metadata. This is how we are
>       tracking views for Presto and Spark (coral integration would be nice!). 
> We
>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>       - Metacat is an open metastore project from Netflix. The metacat
>       package contains our metastore integration
>       
> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>       for it.
>       - The batch package contains Spark and Hive table implementations
>       for Spark’s DSv2
>       
> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>       which we use for multi-catalog support.
>    - Gautam: how will migration to Iceberg’s v2 format work for those of
>    us in production using v1?
>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2 will
>       be supported in parallel. Using v1 until everything is updated with v2
>       support takes care of forward-compatibility issues. This can be done on 
> a
>       per-table basis
>       - Gautam: Does migration require rewriting metadata?
>       - Ryan: No, the format is backward compatible with v1, so the
>       update is metadata-only until the writers start using new metadata that 
> v1
>       would ignore (deletes) and would incorrectly modify if it were to write 
> to
>       v2.
>       - Ryan: Also, Iceberg already has a forward-compatibility check
>       that will prevent v1 readers from loading a v2 table.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg community sync notes - 15 April 2020

Reply via email to