Re: Iceberg community sync notes - 15 April 2020

Mass Dosage Fri, 17 Apr 2020 06:29:14 -0700

Thanks for the detailed notes Ryan. My thoughts on a few of the topics...

0.8.0 release - my general preference is to release early and release
often. If features aren't ready why wait? Why not go with a 0.8.0 release
now and then a 0.9.0 (or whatever) a couple of weeks later with the other
features? I know with Apache projects this can sometimes be a challenge
with all the ceremony around a release, getting votes etc. but I don't
think that's such a problem in the incubating stage?


A clarification on the InputFomats - I think the DDL Ratandeep was
referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
i.e. the "read" path but for statements other than "SELECT" etc. Also, to
be clear the `mapreduce` InputFormat that was contributed - it sounds like
that works for Pig but I don't think it will work for Hive 1 or 2 since
they use the `mapred` API for InputFormats. This is what we have attempted
to cover in our InputFormat. I raised a WIP PR for it yesterday at
https://github.com/apache/incubator-iceberg/pull/933 and would appreciate
feedback from anyone interested in it.

Thanks for sharing the Avro hack for shading and relocating Guava. Should I
create a ticket on GitHub to capture this work? We'll then have a go at
implementing it.

Thanks,

Adrian


On Fri, 17 Apr 2020 at 04:07, OpenInx <open...@gmail.com> wrote:

> Thanks for the writing.
> The views from Netflix branch is a great feature, would have any plan to
> port to Apache Iceberg ?
>
> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>> this if I missed something.
>>
>> There were a couple of questions raised during the sync that we’d like to
>> open up to anyone who wasn’t able to attend:
>>
>>    - Should we wait for the parallel metadata rewrite action before
>>    cutting 0.8.0 candidates?
>>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>
>> In the sync, we thought that it would be good to wait and get these in.
>> Please reply to this if you agree or disagree.
>>
>> Thanks!
>>
>> *Attendees*:
>>
>>    - Ryan Blue
>>    - Dan Weeks
>>    - Anjali Norwood
>>    - Jun Ma
>>    - Ratandeep Ratti
>>    - Pavan
>>    - Christine Mathiesen
>>    - Gautam Kowshik
>>    - Mass Dosage
>>    - Filip
>>    - Ryan Murray
>>
>> *Topics*:
>>
>>    - 0.8.0 release blockers: actions, ORC metrics
>>    - Row-level delete update
>>    - Parquet vectorized read update
>>    - InputFormats and Hive support
>>    - Netflix branch
>>
>> *Discussion*:
>>
>>    - 0.8.0 release
>>       - Ryan: we planned to get a candidate out this week, but I think
>>       we may want to wait on 2 things that are about ready
>>       - First: Anton is contributing an action to rewrite manifests in
>>       parallel that is close. Anyone interested? (Gautam is interested)
>>       - Second: ORC is passing correctness tests, but doesn’t have
>>       column-level metrics. Should we wait for this?
>>       - Ratandeep: ORC also lacks predicate push-down support
>>       - Ryan: I think metrics are more important than PPD because PPD is
>>       task side and metrics help reduce the number of tasks. If we wait on 
>> one,
>>       I’d prefer to wait on metrics
>>       - Ratandeep will look into whether he or Shardul can work on this
>>       - General consensus was to wait for these features before getting
>>       a candidate out
>>    - Row-level deletes
>>       - Good progress in several PRs on adding the parallel v2 write
>>       path, as Owen suggested last sync
>>       - Junjie contributed an update to the spec for file/position
>>       delete files
>>    - Parquet vectorized read
>>       - Dan: flat schema reads are primarily waiting on reviews
>>       - Dan: is anyone interested in complex type support?
>>       - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>       not maps of structs
>>       - Ryan (Blue): Because we have a translation layer in Iceberg to
>>       pass off to Spark, we don’t actually need support in Arrow. We are
>>       currently stuck on 0.14.0 because of changes that prevent us from 
>> avoiding
>>       a null check (see this comment
>>       <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>       )
>>    -
>>
>>    InputFormat and Hive support
>>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig
>>       and Hive; need to start working on the serde side and building a Hive
>>       StorageHandler, missing DDL support
>>       - Ryan: What DDL support?
>>       -
>>
>>       Ratandeep: Statements like ADD PARTITION
>>       -
>>
>>       Ryan: How would all of this work in Hive? It isn’t clear what
>>       components are needed right now: StorageHandler? RawStore? 
>> HiveMetaHook?
>>       - Ratandeep: Currently working on only the read path, which
>>       requires a StorageHandler. The write path would be more difficult.
>>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>>       writes, but not in the short or medium term
>>       - Mass Dosage: The main problem is dependency conflicts between
>>       Hive and Iceberg, mainly Guava
>>       - Ryan: Anyone know a good replacement for Guava collections?
>>       - Ryan: In Avro, we have a module that shades Guava
>>       
>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>>       and has a class with references
>>       
>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>>       Then shading can minimize the shaded classes. We could do that here.
>>       - Ryan: Is Jackson also a problem?
>>       - Mass Dosage: Yes, and calcite
>>       - Ryan: Calcite probably isn’t referenced directly so we can
>>       hopefully avoid the consistent versions problem by excluding it
>>    - Netflix branch of Iceberg (with non-Iceberg additions)
>>       - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
>>       branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
>>       for Spark 2.4 with DSv2 backported
>>       <https://github.com/Netflix/spark>
>>       - The purpose of this is to share non-Iceberg work that we use to
>>       compliment Iceberg, like views, catalogs, and DSv2 tables
>>       - Views are SQL views
>>       
>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>>       that are stored and versioned like Iceberg metadata. This is how we are
>>       tracking views for Presto and Spark (coral integration would be 
>> nice!). We
>>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>>       - Metacat is an open metastore project from Netflix. The metacat
>>       package contains our metastore integration
>>       
>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>>       for it.
>>       - The batch package contains Spark and Hive table implementations
>>       for Spark’s DSv2
>>       
>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>>       which we use for multi-catalog support.
>>    - Gautam: how will migration to Iceberg’s v2 format work for those of
>>    us in production using v1?
>>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2
>>       will be supported in parallel. Using v1 until everything is updated 
>> with v2
>>       support takes care of forward-compatibility issues. This can be done 
>> on a
>>       per-table basis
>>       - Gautam: Does migration require rewriting metadata?
>>       - Ryan: No, the format is backward compatible with v1, so the
>>       update is metadata-only until the writers start using new metadata 
>> that v1
>>       would ignore (deletes) and would incorrectly modify if it were to 
>> write to
>>       v2.
>>       - Ryan: Also, Iceberg already has a forward-compatibility check
>>       that will prevent v1 readers from loading a v2 table.
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Iceberg community sync notes - 15 April 2020

Reply via email to