[
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149557#comment-15149557
]
Wes McKinney commented on PARQUET-531:
--------------------------------------
[~mdeepak] I just tried reading this file after PARQUET-515 and PARQUET-523
were applied and it appears the bug lies in the Scanner, so we can leave this
open until we have a test case reproduction
> Can't read past first page in a column
> --------------------------------------
>
> Key: PARQUET-531
> URL: https://issues.apache.org/jira/browse/PARQUET-531
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Environment: Ubuntu Linux 14.04 (no obvious platform dependence),
> Parquet file created by Apache Spark 1.5.0 on the same platform.
> Reporter: Spiro Michaylov
> Attachments:
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
> case parquet::CompressionCodec::GZIP:
> decompressor_.reset(new GZipCodec());
> break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach,
> which was created by Apache Spark 1.5.0. It works surprisingly well until it
> hits the end of the first page, where it dies with
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support
> is new and (b) I had to modify the code to enable it, but actually things
> seem to decompress just fine (congratulations: this is awesome!): looking at
> the problem in the debugger and tracing through a bit it seems to me like the
> buffering is a bit screwed up in general -- some kind of confusion between
> the buffering at the Scanner and Reader levels. I can reproduce the problem
> by reading through just a single column too.
> It fails after 128 rows, which is suspicious given this line in
> column/scanner.h:
> {code}
> DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)