[ 
https://issues.apache.org/jira/browse/PARQUET-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030160#comment-17030160
 ] 

Deepak Majeti commented on PARQUET-1781:
----------------------------------------

Even though the 1.3 writer wrote the "min_value", "max_value" along with the 
old "min", "max", the new statistics are not valid since the column order is 
not set according to the Parquet spec. In a way, this is a bug in the 1.3 
reader to return new stats without verifying the column order. The reader in 
1.4 does the right thing.

> [C++] 1.4.0+ reader ignore stats created by 1.3.* writer
> --------------------------------------------------------
>
>                 Key: PARQUET-1781
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1781
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.4.0, cpp-1.5.0
>            Reporter: Milos Sukovic
>            Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> [https://github.com/apache/arrow/commit/d257a88ed612301c0411894dfa783fcbff1bc867]
> In referenced commit, change to metadata.cc file changed the way for checking 
> if new stats (min_value/max_value) are used.
> From
> if (metadata.statistics.__isset.max_value || 
> metadata.statistics.__isset.min_value)
> to
> if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER)
>  
> This change is breaking backward compat - all files which contain new stats 
> (min_value/max_value), and are created before this change are valid, but they 
> do not set column order flag.
> After this change, those stats are ignored, because column order flag is 
> checked.
> Possible fix would be something like:
> if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER || 
> (version == parquetcpp 1.3.* && (metadata.statistics.__isset.max_value || 
> metadata.statistics.__isset.min_value)))
> I checked parquet-mr, and it seems like there, columnOrder is introduced as 
> part of the same change as min_value and max_value, so issue shouldn't happen 
> for files created by java code, but probably, stats are ignored by their 
> reader too for files created by parquet-cpp 1.3.*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to