[jira] [Comment Edited] (DRILL-7291) parquet with compression gzip doesn't work well

benj (Jira) Mon, 07 Oct 2019 02:56:57 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944585#comment-16944585
 ]


benj edited comment on DRILL-7291 at 10/7/19 9:55 AM:
------------------------------------------------------

Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE ....`file_gzip_pqt` AS (SELECT * FROM ....`file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM ....`file_gzip_pqt`; /* OK */
SELECT count(*) FROM ....`file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM ....`file_gzip_pqt`; /* OK */
SELECT count(*) FROM ....`file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that 
{code:sql}
SELECT count( * ) FROM ....`file_gzip_pqt` WHERE SpecialCode =' '; /* OK */
SELECT count( * ) FROM ....`file_gzip_pqt` WHERE SpecialCode <> ''; /* NOK */
{code} 
But, maybe it's because all the values of 'SpecialCode' column are empty ("")


was (Author: benj641):
Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE ....`file_gzip_pqt` AS (SELECT * FROM ....`file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM ....`file_gzip_pqt`; /* OK */
SELECT count(*) FROM ....`file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM ....`file_gzip_pqt`; /* OK */
SELECT count(*) FROM ....`file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that {code:sql}SELECT count( * ) FROM ....`file_gzip_pqt` WHERE 
COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?)

> parquet with compression gzip doesn't work well
> -----------------------------------------------
>
>                 Key: DRILL-7291
>                 URL: https://issues.apache.org/jira/browse/DRILL-7291
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: benj
>            Priority: Major
>         Attachments: 0_0_0.parquet
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE ....`file_snappy_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE ....`file_gzip_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM ....`file_pqt`;        => 15728036
> SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C';      => 2
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (DRILL-7291) parquet with compression gzip doesn't work well

Reply via email to