Hey Anton, #1) Part of the reason Iceberg has a custom reader is to help resolve some of the Iceberg specific aspects of how parquet files are read (e.g. column resolution by id, iceberg expressions). Also, it's been a struggle to get agreement on a good vectorized api. I don't believe the objective of building a read path in Iceberg was strictly about performance, but more about being able to iterate quickly and prototype new approaches that we hope to ultimately feed back to the parquet-mr project.
#2) Iceberg isn't using the same path, so row group and dictionary filtering are handled by Iceberg (though record level filtering is not). I don't believe this is any specific problem with the parquet-mr other than where it is implemented in the read path and that Iceberg has its own expression implementation. #3) There is active work going on related to page skipping (PARQUET-1201 <https://issues.apache.org/jira/browse/PARQUET-1201>). I believe this may be what you are referring to. #4) Ideally we would be able to contribute back the read/write path implementations to parquet-mr and get an updated API that can be used for future development. #5) We do intend to build a native parquet to arrow read path. Initially this will likely be specific to Iceberg as we iterate on the implementation, but will we hope to be generally usable across multiple engines. For example, both Spark and Presto have custom read paths for parquet and their own columnar memory format. We hope that we can build an Iceberg read path that can be used across both and leverage arrow natively through columnar api abstractions. -Dan On Tue, May 28, 2019 at 8:03 AM Wes McKinney <[email protected]> wrote: > hi Anton, > > On point #5, I would suggest doing the work either in Apache Arrow or > in the Parquet Java project -- we are developing both Parquet C++ and > Rust codebases within the apache/arrow repository so I think you would > find an active community there. I know that there has been a lot of > interest in decoupling from Hadoop-related Java dependencies, so you > might also think about how to do that at the same time. > > - Wes > > On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi > <[email protected]> wrote: > > > > Hi, > > > > I see more and more questions around Iceberg Parquet reader. I think it > would be useful to have a thread that clarifies all open questions and > explains the long-term plan. > > > > 1. Am I correct that performance is the main reason to have a custom > reader in Iceberg? Are there any other purposes? A common question I get is > why not improve parquet-mr instead of writing a new reader? I know that > almost every system that cares about performance has its own reader, but > why so? > > > > 2. Iceberg filters out row groups based on stats and dictionary pages > on its own whereas the Spark reader simply sets filters and relies on > parquet-mr to do the filtering. My assumption there is a problem in > parquet-mr. Is it correct? Is it somehow related to record materialization? > > > > 3. At some point, Julien Le Dem gave a talk about supporting page > skipping in Parquet. His primary example was SELECT a, b FROM t WHERE c = > 'smth'. Basically, filtering data in columns based on predicates on other > columns. It is a highly anticipated feature on our end. Can somebody > clarify if it will be part of parquet-mr or we will have to implement this > in Iceberg? > > > > 4. What is the long-term vision for the Parquet reader in Iceberg? Are > there any plans to submit parts of it to parquet-mr? Will Iceberg reader be > mostly independent of parquet-mr? > > > > 5. We are considering reading Parquet data into Arrow. Will be it > something specific to Iceberg or generally available? I believe it is a > quite common use case. > > > > Thanks, > > Anton > > >
