Re: Future of Iceberg Parquet Reader

Daniel Weeks Tue, 28 May 2019 09:19:14 -0700

Hey Anton,

#1) Part of the reason Iceberg has a custom reader is to help resolve some
of the Iceberg specific aspects of how parquet files are read (e.g. column
resolution by id, iceberg expressions).  Also, it's been a struggle to get
agreement on a good vectorized api.  I don't believe the objective of
building a read path in Iceberg was strictly about performance, but more
about being able to iterate quickly and prototype new approaches that we
hope to ultimately feed back to the parquet-mr project.

#2) Iceberg isn't using the same path, so row group and dictionary
filtering are handled by Iceberg (though record level filtering is not).  I
don't believe this is any specific problem with the parquet-mr other than
where it is implemented in the read path and that Iceberg has its own
expression implementation.

#3) There is active work going on related to page skipping (PARQUET-1201
<https://issues.apache.org/jira/browse/PARQUET-1201>).  I believe this may
be what you are referring to.

#4) Ideally we would be able to contribute back the read/write path
implementations to parquet-mr and get an updated API that can be used for
future development.

#5) We do intend to build a native parquet to arrow read path.  Initially
this will likely be specific to Iceberg as we iterate on the
implementation, but will we hope to be generally usable across multiple
engines.  For example, both Spark and Presto have custom read paths for
parquet and their own columnar memory format.  We hope that we can build an
Iceberg read path that can be used across both and leverage arrow natively
through columnar api abstractions.

-Dan

On Tue, May 28, 2019 at 8:03 AM Wes McKinney <[email protected]> wrote:

> hi Anton,
>
> On point #5, I would suggest doing the work either in Apache Arrow or
> in the Parquet Java project -- we are developing both Parquet C++ and
> Rust codebases within the apache/arrow repository so I think you would
> find an active community there. I know that there has been a lot of
> interest in decoupling from Hadoop-related Java dependencies, so you
> might also think about how to do that at the same time.
>
> - Wes
>
> On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I see more and more questions around Iceberg Parquet reader. I think it
> would be useful to have a thread that clarifies all open questions and
> explains the long-term plan.
> >
> > 1. Am I correct that performance is the main reason to have a custom
> reader in Iceberg? Are there any other purposes? A common question I get is
> why not improve parquet-mr instead of writing a new reader? I know that
> almost every system that cares about performance has its own reader, but
> why so?
> >
> > 2.  Iceberg filters out row groups based on stats and dictionary pages
> on its own whereas the Spark reader simply sets filters and relies on
> parquet-mr to do the filtering. My assumption there is a problem in
> parquet-mr. Is it correct? Is it somehow related to record materialization?
> >
> > 3. At some point, Julien Le Dem gave a talk about supporting page
> skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c =
> 'smth'. Basically, filtering data in columns based on predicates on other
> columns. It is a highly anticipated feature on our end. Can somebody
> clarify if it will be part of parquet-mr or we will have to implement this
> in Iceberg?
> >
> > 4.  What is the long-term vision for the Parquet reader in Iceberg? Are
> there any plans to submit parts of it to parquet-mr? Will Iceberg reader be
> mostly independent of parquet-mr?
> >
> > 5. We are considering reading Parquet data into Arrow. Will be it
> something specific to Iceberg or generally available? I believe it is a
> quite common use case.
> >
> > Thanks,
> > Anton
> >
>

Re: Future of Iceberg Parquet Reader

Reply via email to