Re: Iceberg community sync notes - 15 April 2020

RD Fri, 17 Apr 2020 07:59:01 -0700

Thanks for the Correction Adrian.  I've filed the ticket for github here:
https://github.com/apache/incubator-iceberg/issues/934 . There are 2
approaches mentioned there with pros/cons. Will be good to get the
community's feedback on how to proceed.


-best,
R.

On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage <massdos...@gmail.com> wrote:

> Thanks for the detailed notes Ryan. My thoughts on a few of the topics...
>
> 0.8.0 release - my general preference is to release early and release
> often. If features aren't ready why wait? Why not go with a 0.8.0 release
> now and then a 0.9.0 (or whatever) a couple of weeks later with the other
> features? I know with Apache projects this can sometimes be a challenge
> with all the ceremony around a release, getting votes etc. but I don't
> think that's such a problem in the incubating stage?
>
> A clarification on the InputFomats - I think the DDL Ratandeep was
> referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
> i.e. the "read" path but for statements other than "SELECT" etc. Also, to
> be clear the `mapreduce` InputFormat that was contributed - it sounds like
> that works for Pig but I don't think it will work for Hive 1 or 2 since
> they use the `mapred` API for InputFormats. This is what we have attempted
> to cover in our InputFormat. I raised a WIP PR for it yesterday at
> https://github.com/apache/incubator-iceberg/pull/933 and would appreciate
> feedback from anyone interested in it.
>
> Thanks for sharing the Avro hack for shading and relocating Guava. Should
> I create a ticket on GitHub to capture this work? We'll then have a go at
> implementing it.
>
> Thanks,
>
> Adrian
>
>
> On Fri, 17 Apr 2020 at 04:07, OpenInx <open...@gmail.com> wrote:
>
>> Thanks for the writing.
>> The views from Netflix branch is a great feature, would have any plan to
>> port to Apache Iceberg ?
>>
>> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>>> this if I missed something.
>>>
>>> There were a couple of questions raised during the sync that we’d like
>>> to open up to anyone who wasn’t able to attend:
>>>
>>>    - Should we wait for the parallel metadata rewrite action before
>>>    cutting 0.8.0 candidates?
>>>    - Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>>
>>> In the sync, we thought that it would be good to wait and get these in.
>>> Please reply to this if you agree or disagree.
>>>
>>> Thanks!
>>>
>>> *Attendees*:
>>>
>>>    - Ryan Blue
>>>    - Dan Weeks
>>>    - Anjali Norwood
>>>    - Jun Ma
>>>    - Ratandeep Ratti
>>>    - Pavan
>>>    - Christine Mathiesen
>>>    - Gautam Kowshik
>>>    - Mass Dosage
>>>    - Filip
>>>    - Ryan Murray
>>>
>>> *Topics*:
>>>
>>>    - 0.8.0 release blockers: actions, ORC metrics
>>>    - Row-level delete update
>>>    - Parquet vectorized read update
>>>    - InputFormats and Hive support
>>>    - Netflix branch
>>>
>>> *Discussion*:
>>>
>>>    - 0.8.0 release
>>>       - Ryan: we planned to get a candidate out this week, but I think
>>>       we may want to wait on 2 things that are about ready
>>>       - First: Anton is contributing an action to rewrite manifests in
>>>       parallel that is close. Anyone interested? (Gautam is interested)
>>>       - Second: ORC is passing correctness tests, but doesn’t have
>>>       column-level metrics. Should we wait for this?
>>>       - Ratandeep: ORC also lacks predicate push-down support
>>>       - Ryan: I think metrics are more important than PPD because PPD
>>>       is task side and metrics help reduce the number of tasks. If we wait 
>>> on
>>>       one, I’d prefer to wait on metrics
>>>       - Ratandeep will look into whether he or Shardul can work on this
>>>       - General consensus was to wait for these features before getting
>>>       a candidate out
>>>    - Row-level deletes
>>>       - Good progress in several PRs on adding the parallel v2 write
>>>       path, as Owen suggested last sync
>>>       - Junjie contributed an update to the spec for file/position
>>>       delete files
>>>    - Parquet vectorized read
>>>       - Dan: flat schema reads are primarily waiting on reviews
>>>       - Dan: is anyone interested in complex type support?
>>>       - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>>>       - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>>       not maps of structs
>>>       - Ryan (Blue): Because we have a translation layer in Iceberg to
>>>       pass off to Spark, we don’t actually need support in Arrow. We are
>>>       currently stuck on 0.14.0 because of changes that prevent us from 
>>> avoiding
>>>       a null check (see this comment
>>>       
>>> <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>>       )
>>>    -
>>>
>>>    InputFormat and Hive support
>>>    - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for
>>>       Pig and Hive; need to start working on the serde side and building a 
>>> Hive
>>>       StorageHandler, missing DDL support
>>>       - Ryan: What DDL support?
>>>       -
>>>
>>>       Ratandeep: Statements like ADD PARTITION
>>>       -
>>>
>>>       Ryan: How would all of this work in Hive? It isn’t clear what
>>>       components are needed right now: StorageHandler? RawStore? 
>>> HiveMetaHook?
>>>       - Ratandeep: Currently working on only the read path, which
>>>       requires a StorageHandler. The write path would be more difficult.
>>>       - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>>>       iceberg-mr, started working on a serde in iceberg-hive. Interested in
>>>       writes, but not in the short or medium term
>>>       - Mass Dosage: The main problem is dependency conflicts between
>>>       Hive and Iceberg, mainly Guava
>>>       - Ryan: Anyone know a good replacement for Guava collections?
>>>       - Ryan: In Avro, we have a module that shades Guava
>>>       
>>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
>>>       and has a class with references
>>>       
>>> <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
>>>       Then shading can minimize the shaded classes. We could do that here.
>>>       - Ryan: Is Jackson also a problem?
>>>       - Mass Dosage: Yes, and calcite
>>>       - Ryan: Calcite probably isn’t referenced directly so we can
>>>       hopefully avoid the consistent versions problem by excluding it
>>>    - Netflix branch of Iceberg (with non-Iceberg additions)
>>>       - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
>>>       branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
>>>       for Spark 2.4 with DSv2 backported
>>>       <https://github.com/Netflix/spark>
>>>       - The purpose of this is to share non-Iceberg work that we use to
>>>       compliment Iceberg, like views, catalogs, and DSv2 tables
>>>       - Views are SQL views
>>>       
>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
>>>       that are stored and versioned like Iceberg metadata. This is how we 
>>> are
>>>       tracking views for Presto and Spark (coral integration would be 
>>> nice!). We
>>>       are contributing the Spark DSv2 ViewCatalog to upstream Spark
>>>       - Metacat is an open metastore project from Netflix. The metacat
>>>       package contains our metastore integration
>>>       
>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
>>>       for it.
>>>       - The batch package contains Spark and Hive table implementations
>>>       for Spark’s DSv2
>>>       
>>> <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
>>>       which we use for multi-catalog support.
>>>    - Gautam: how will migration to Iceberg’s v2 format work for those
>>>    of us in production using v1?
>>>       - Ryan: Tables are explicitly updated to v2 and both v1 and v2
>>>       will be supported in parallel. Using v1 until everything is updated 
>>> with v2
>>>       support takes care of forward-compatibility issues. This can be done 
>>> on a
>>>       per-table basis
>>>       - Gautam: Does migration require rewriting metadata?
>>>       - Ryan: No, the format is backward compatible with v1, so the
>>>       update is metadata-only until the writers start using new metadata 
>>> that v1
>>>       would ignore (deletes) and would incorrectly modify if it were to 
>>> write to
>>>       v2.
>>>       - Ryan: Also, Iceberg already has a forward-compatibility check
>>>       that will prevent v1 readers from loading a v2 table.
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: Iceberg community sync notes - 15 April 2020

Reply via email to