[ https://issues.apache.org/jira/browse/PARQUET-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138299#comment-15138299 ]
Deepak Majeti edited comment on PARQUET-505 at 2/9/16 3:30 PM: --------------------------------------------------------------- [~wesmckinn] I followed the Impala code. I am adding a unit test as well. -- regards, Deepak Majeti was (Author: mdeepak): [~wesm] I followed the Impala code. I am adding a unit test as well. -- regards, Deepak Majeti > Column reader: automatically handle large data pages > ---------------------------------------------------- > > Key: PARQUET-505 > URL: https://issues.apache.org/jira/browse/PARQUET-505 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Wes McKinney > Assignee: Deepak Majeti > > Currently, we are only supporting data pages whose headers are 64K or less > (see {{parquet/column/serialized-page.cc}}. Since page headers can > essentially be arbitrarily large (in pathological cases) because of the page > statistics, if deserializing the page header fails, we should attempt to read > a progressively larger amount of file data in effort to find the end of the > page header. > As part of this (and to make testing easier!), the maximum data page header > size should be configurable. We can write test cases by defining appropriate > Statistics structs to yield serialized page headers of whatever desired size. > On malformed files, we may run past the end of the file, in such cases we > should raise a reasonable exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)