hi Anton,

On point #5, I would suggest doing the work either in Apache Arrow or
in the Parquet Java project -- we are developing both Parquet C++ and
Rust codebases within the apache/arrow repository so I think you would
find an active community there. I know that there has been a lot of
interest in decoupling from Hadoop-related Java dependencies, so you
might also think about how to do that at the same time.

- Wes

On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:
>
> Hi,
>
> I see more and more questions around Iceberg Parquet reader. I think it would 
> be useful to have a thread that clarifies all open questions and explains the 
> long-term plan.
>
> 1. Am I correct that performance is the main reason to have a custom reader 
> in Iceberg? Are there any other purposes? A common question I get is why not 
> improve parquet-mr instead of writing a new reader? I know that almost every 
> system that cares about performance has its own reader, but why so?
>
> 2.  Iceberg filters out row groups based on stats and dictionary pages on its 
> own whereas the Spark reader simply sets filters and relies on parquet-mr to 
> do the filtering. My assumption there is a problem in parquet-mr. Is it 
> correct? Is it somehow related to record materialization?
>
> 3. At some point, Julien Le Dem gave a talk about supporting page skipping in 
> Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. 
> Basically, filtering data in columns based on predicates on other columns. It 
> is a highly anticipated feature on our end. Can somebody clarify if it will 
> be part of parquet-mr or we will have to implement this in Iceberg?
>
> 4.  What is the long-term vision for the Parquet reader in Iceberg? Are there 
> any plans to submit parts of it to parquet-mr? Will Iceberg reader be mostly 
> independent of parquet-mr?
>
> 5. We are considering reading Parquet data into Arrow. Will be it something 
> specific to Iceberg or generally available? I believe it is a quite common 
> use case.
>
> Thanks,
> Anton
>

Reply via email to