hi Anton, On point #5, I would suggest doing the work either in Apache Arrow or in the Parquet Java project -- we are developing both Parquet C++ and Rust codebases within the apache/arrow repository so I think you would find an active community there. I know that there has been a lot of interest in decoupling from Hadoop-related Java dependencies, so you might also think about how to do that at the same time.
- Wes On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > > Hi, > > I see more and more questions around Iceberg Parquet reader. I think it would > be useful to have a thread that clarifies all open questions and explains the > long-term plan. > > 1. Am I correct that performance is the main reason to have a custom reader > in Iceberg? Are there any other purposes? A common question I get is why not > improve parquet-mr instead of writing a new reader? I know that almost every > system that cares about performance has its own reader, but why so? > > 2. Iceberg filters out row groups based on stats and dictionary pages on its > own whereas the Spark reader simply sets filters and relies on parquet-mr to > do the filtering. My assumption there is a problem in parquet-mr. Is it > correct? Is it somehow related to record materialization? > > 3. At some point, Julien Le Dem gave a talk about supporting page skipping in > Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. > Basically, filtering data in columns based on predicates on other columns. It > is a highly anticipated feature on our end. Can somebody clarify if it will > be part of parquet-mr or we will have to implement this in Iceberg? > > 4. What is the long-term vision for the Parquet reader in Iceberg? Are there > any plans to submit parts of it to parquet-mr? Will Iceberg reader be mostly > independent of parquet-mr? > > 5. We are considering reading Parquet data into Arrow. Will be it something > specific to Iceberg or generally available? I believe it is a quite common > use case. > > Thanks, > Anton >