[ 
https://issues.apache.org/jira/browse/PARQUET-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343916#comment-14343916
 ] 

Ryan Blue commented on PARQUET-188:
-----------------------------------

The problem here is actually that the column chunk metadata in the metadata for 
a row group is out of order, not that the column chunks themselves are out of 
order. So Impala's implementation is correctly using the row group offset, but 
decides which offset to use based on the order of the column metadata order 
rather than column names. An order for columns in the metadata isn't required 
by the format spec, so we should add one or update Impala to not assume there 
is an order. I strongly prefer the former: let's add to the spec that the 
column metadata should match the schema.

The change seems to have happened here: 
https://github.com/apache/incubator-parquet-mr/commit/ccc29e4dde24584118211f27c71bb01bacc39326#diff-e07dbce51b4235a8510fc220b9bf4b48R216

I think we just need to patch that and the ordering will be back to normal. 
PageFormatV2 has not been released in any formal release or a release candidate 
(RC4 was in Nov, the above commit was Dec), so we should be able to fix this 
without a problem.

> Parquet writes columns out of order (compared to the schema)
> ------------------------------------------------------------
>
>                 Key: PARQUET-188
>                 URL: https://issues.apache.org/jira/browse/PARQUET-188
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Colin Marc
>
> When building from master, parquet seems to write row groups with the columns 
> in arbitrary orders, not in the same order as the schema. This appears to 
> happen regardless of the OutputFormat or WriteSupport used.
> This breaks implementations that assume the columns will be in a specific 
> order, in particular impala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to