On Tue, May 28, 2019 at 11:19 AM Daniel Weeks <[email protected]> wrote: > > Hey Anton, > > #1) Part of the reason Iceberg has a custom reader is to help resolve some of > the Iceberg specific aspects of how parquet files are read (e.g. column > resolution by id, iceberg expressions). Also, it's been a struggle to get > agreement on a good vectorized api. I don't believe the objective of > building a read path in Iceberg was strictly about performance, but more > about being able to iterate quickly and prototype new approaches that we hope > to ultimately feed back to the parquet-mr project. > > #2) Iceberg isn't using the same path, so row group and dictionary filtering > are handled by Iceberg (though record level filtering is not). I don't > believe this is any specific problem with the parquet-mr other than where it > is implemented in the read path and that Iceberg has its own expression > implementation. > > #3) There is active work going on related to page skipping (PARQUET-1201). I > believe this may be what you are referring to. > > #4) Ideally we would be able to contribute back the read/write path > implementations to parquet-mr and get an updated API that can be used for > future development. > > #5) We do intend to build a native parquet to arrow read path. Initially > this will likely be specific to Iceberg as we iterate on the implementation, > but will we hope to be generally usable across multiple engines. For > example, both Spark and Presto have custom read paths for parquet and their > own columnar memory format. We hope that we can build an Iceberg read path > that can be used across both and leverage arrow natively through columnar api > abstractions. >
Presto and Spark aren't exactly analogous because they have bespoke in-memory formats, so it wouldn't make sense to develop Parquet serialization/deserialization anywhere else. It would be unfortunate if Parquet-to-Arrow conversion in Java at the granularity of a single file is hidden behind Iceberg business logic, so I would encourage you to make the lower-level single file interface as accessible to general Arrow users as possible. > -Dan > > On Tue, May 28, 2019 at 8:03 AM Wes McKinney <[email protected]> wrote: >> >> hi Anton, >> >> On point #5, I would suggest doing the work either in Apache Arrow or >> in the Parquet Java project -- we are developing both Parquet C++ and >> Rust codebases within the apache/arrow repository so I think you would >> find an active community there. I know that there has been a lot of >> interest in decoupling from Hadoop-related Java dependencies, so you >> might also think about how to do that at the same time. >> >> - Wes >> >> On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi >> <[email protected]> wrote: >> > >> > Hi, >> > >> > I see more and more questions around Iceberg Parquet reader. I think it >> > would be useful to have a thread that clarifies all open questions and >> > explains the long-term plan. >> > >> > 1. Am I correct that performance is the main reason to have a custom >> > reader in Iceberg? Are there any other purposes? A common question I get >> > is why not improve parquet-mr instead of writing a new reader? I know that >> > almost every system that cares about performance has its own reader, but >> > why so? >> > >> > 2. Iceberg filters out row groups based on stats and dictionary pages on >> > its own whereas the Spark reader simply sets filters and relies on >> > parquet-mr to do the filtering. My assumption there is a problem in >> > parquet-mr. Is it correct? Is it somehow related to record materialization? >> > >> > 3. At some point, Julien Le Dem gave a talk about supporting page skipping >> > in Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. >> > Basically, filtering data in columns based on predicates on other columns. >> > It is a highly anticipated feature on our end. Can somebody clarify if it >> > will be part of parquet-mr or we will have to implement this in Iceberg? >> > >> > 4. What is the long-term vision for the Parquet reader in Iceberg? Are >> > there any plans to submit parts of it to parquet-mr? Will Iceberg reader >> > be mostly independent of parquet-mr? >> > >> > 5. We are considering reading Parquet data into Arrow. Will be it >> > something specific to Iceberg or generally available? I believe it is a >> > quite common use case. >> > >> > Thanks, >> > Anton >> >
