[ 
https://issues.apache.org/jira/browse/PARQUET-139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-139.
-------------------------------
       Resolution: Fixed
    Fix Version/s: parquet-mr_1.6.0

Issue resolved by pull request 91
[https://github.com/apache/incubator-parquet-mr/pull/91]

> Avoid reading file footers in parquet-avro InputFormat
> ------------------------------------------------------
>
>                 Key: PARQUET-139
>                 URL: https://issues.apache.org/jira/browse/PARQUET-139
>             Project: Parquet
>          Issue Type: Task
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>             Fix For: parquet-mr_1.6.0
>
>
> The AvroParquetInputFormat currently relies on the ParquetInputFormat that 
> reads the footers for all of the files that will be processed. This is for 
> two reasons:
> 1. To plan splits (if using client side splits)
> 2. To get a merged schema for all of the files
> Reading all of the footers is a bottle-neck when working with a large number 
> of files and can significantly delay a job because only one machine is 
> working. This should be done in parallel on the task side. PARQUET-84 added 
> the ability to avoid reading footers on the client for split planning, so the 
> difficult task is to avoid reading footers to merge the Parquet schema.
> To avoid merging the Parquet schema, the AvroParquetInputFormat should either 
> use whatever schema a file contains or should reconcile the projection schema 
> with the file schema on the task side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to