Re: Future of Iceberg Parquet Reader

Wes McKinney Tue, 28 May 2019 09:58:56 -0700

On Tue, May 28, 2019 at 11:19 AM Daniel Weeks
<[email protected]> wrote:
>
> Hey Anton,
>
> #1) Part of the reason Iceberg has a custom reader is to help resolve some of 
> the Iceberg specific aspects of how parquet files are read (e.g. column 
> resolution by id, iceberg expressions).  Also, it's been a struggle to get 
> agreement on a good vectorized api.  I don't believe the objective of 
> building a read path in Iceberg was strictly about performance, but more 
> about being able to iterate quickly and prototype new approaches that we hope 
> to ultimately feed back to the parquet-mr project.
>
> #2) Iceberg isn't using the same path, so row group and dictionary filtering 
> are handled by Iceberg (though record level filtering is not).  I don't 
> believe this is any specific problem with the parquet-mr other than where it 
> is implemented in the read path and that Iceberg has its own expression 
> implementation.
>
> #3) There is active work going on related to page skipping (PARQUET-1201).  I 
> believe this may be what you are referring to.
>
> #4) Ideally we would be able to contribute back the read/write path 
> implementations to parquet-mr and get an updated API that can be used for 
> future development.
>
> #5) We do intend to build a native parquet to arrow read path.  Initially 
> this will likely be specific to Iceberg as we iterate on the implementation, 
> but will we hope to be generally usable across multiple engines.  For 
> example, both Spark and Presto have custom read paths for parquet and their 
> own columnar memory format.  We hope that we can build an Iceberg read path 
> that can be used across both and leverage arrow natively through columnar api 
> abstractions.
>


Presto and Spark aren't exactly analogous because they have bespoke
in-memory formats, so it wouldn't make sense to develop Parquet
serialization/deserialization anywhere else. It would be unfortunate
if Parquet-to-Arrow conversion in Java at the granularity of a single
file is hidden behind Iceberg business logic, so I would encourage you
to make the lower-level single file interface as accessible to general
Arrow users as possible.

> -Dan
>
> On Tue, May 28, 2019 at 8:03 AM Wes McKinney <[email protected]> wrote:
>>
>> hi Anton,
>>
>> On point #5, I would suggest doing the work either in Apache Arrow or
>> in the Parquet Java project -- we are developing both Parquet C++ and
>> Rust codebases within the apache/arrow repository so I think you would
>> find an active community there. I know that there has been a lot of
>> interest in decoupling from Hadoop-related Java dependencies, so you
>> might also think about how to do that at the same time.
>>
>> - Wes
>>
>> On Tue, May 28, 2019 at 9:53 AM Anton Okolnychyi
>> <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > I see more and more questions around Iceberg Parquet reader. I think it 
>> > would be useful to have a thread that clarifies all open questions and 
>> > explains the long-term plan.
>> >
>> > 1. Am I correct that performance is the main reason to have a custom 
>> > reader in Iceberg? Are there any other purposes? A common question I get 
>> > is why not improve parquet-mr instead of writing a new reader? I know that 
>> > almost every system that cares about performance has its own reader, but 
>> > why so?
>> >
>> > 2.  Iceberg filters out row groups based on stats and dictionary pages on 
>> > its own whereas the Spark reader simply sets filters and relies on 
>> > parquet-mr to do the filtering. My assumption there is a problem in 
>> > parquet-mr. Is it correct? Is it somehow related to record materialization?
>> >
>> > 3. At some point, Julien Le Dem gave a talk about supporting page skipping 
>> > in Parquet. His primary example was SELECT a, b FROM t WHERE c = 'smth'. 
>> > Basically, filtering data in columns based on predicates on other columns. 
>> > It is a highly anticipated feature on our end. Can somebody clarify if it 
>> > will be part of parquet-mr or we will have to implement this in Iceberg?
>> >
>> > 4.  What is the long-term vision for the Parquet reader in Iceberg? Are 
>> > there any plans to submit parts of it to parquet-mr? Will Iceberg reader 
>> > be mostly independent of parquet-mr?
>> >
>> > 5. We are considering reading Parquet data into Arrow. Will be it 
>> > something specific to Iceberg or generally available? I believe it is a 
>> > quite common use case.
>> >
>> > Thanks,
>> > Anton
>> >

Re: Future of Iceberg Parquet Reader

Reply via email to