[jira] [Commented] (PARQUET-531) Can't read past first page in a column

Wes McKinney (JIRA) Tue, 16 Feb 2016 16:14:51 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149557#comment-15149557
 ]


Wes McKinney commented on PARQUET-531:
--------------------------------------

[~mdeepak] I just tried reading this file after PARQUET-515 and PARQUET-523 
were applied and it appears the bug lies in the Scanner, so we can leave this 
open until we have a test case reproduction

> Can't read past first page in a column
> --------------------------------------
>
>                 Key: PARQUET-531
>                 URL: https://issues.apache.org/jira/browse/PARQUET-531
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>         Environment: Ubuntu Linux 14.04 (no obvious platform dependence), 
> Parquet file created by Apache Spark 1.5.0 on the same platform. 
>            Reporter: Spiro Michaylov
>         Attachments: 
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code 
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>      case parquet::CompressionCodec::GZIP:
>        decompressor_.reset(new GZipCodec());
>        break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, 
> which was created by Apache Spark 1.5.0. It works surprisingly well until it 
> hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support 
> is new and (b) I had to modify the code to enable it, but actually things 
> seem to decompress just fine (congratulations: this is awesome!): looking at 
> the problem in the debugger and tracing through a bit it seems to me like the 
> buffering is a bit screwed up in general -- some kind of confusion between 
> the buffering at the Scanner and Reader levels. I can reproduce the problem 
> by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in 
> column/scanner.h:
> {code}
>     DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-531) Can't read past first page in a column

Reply via email to