[
https://issues.apache.org/jira/browse/PARQUET-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233755#comment-14233755
]
Ryan Blue commented on PARQUET-139:
-----------------------------------
I've added a work-in-progress branch as a preliminary pull request,
[#91|https://github.com/apache/incubator-parquet-mr/pull/91]. Comments on the
approach are welcome.
> Avoid reading file footers in parquet-avro InputFormat
> ------------------------------------------------------
>
> Key: PARQUET-139
> URL: https://issues.apache.org/jira/browse/PARQUET-139
> Project: Parquet
> Issue Type: Task
> Reporter: Ryan Blue
>
> The AvroParquetInputFormat currently relies on the ParquetInputFormat that
> reads the footers for all of the files that will be processed. This is for
> two reasons:
> 1. To plan splits (if using client side splits)
> 2. To get a merged schema for all of the files
> Reading all of the footers is a bottle-neck when working with a large number
> of files and can significantly delay a job because only one machine is
> working. This should be done in parallel on the task side. PARQUET-84 added
> the ability to avoid reading footers on the client for split planning, so the
> difficult task is to avoid reading footers to merge the Parquet schema.
> To avoid merging the Parquet schema, the AvroParquetInputFormat should either
> use whatever schema a file contains or should reconcile the projection schema
> with the file schema on the task side.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)