[ 
https://issues.apache.org/jira/browse/PARQUET-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233755#comment-14233755
 ] 

Ryan Blue commented on PARQUET-139:
-----------------------------------

I've added a work-in-progress branch as a preliminary pull request, 
[#91|https://github.com/apache/incubator-parquet-mr/pull/91]. Comments on the 
approach are welcome.

> Avoid reading file footers in parquet-avro InputFormat
> ------------------------------------------------------
>
>                 Key: PARQUET-139
>                 URL: https://issues.apache.org/jira/browse/PARQUET-139
>             Project: Parquet
>          Issue Type: Task
>            Reporter: Ryan Blue
>
> The AvroParquetInputFormat currently relies on the ParquetInputFormat that 
> reads the footers for all of the files that will be processed. This is for 
> two reasons:
> 1. To plan splits (if using client side splits)
> 2. To get a merged schema for all of the files
> Reading all of the footers is a bottle-neck when working with a large number 
> of files and can significantly delay a job because only one machine is 
> working. This should be done in parallel on the task side. PARQUET-84 added 
> the ability to avoid reading footers on the client for split planning, so the 
> difficult task is to avoid reading footers to merge the Parquet schema.
> To avoid merging the Parquet schema, the AvroParquetInputFormat should either 
> use whatever schema a file contains or should reconcile the projection schema 
> with the file schema on the task side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to