[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

benj (Jira) Fri, 04 Oct 2019 08:14:11 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944585#comment-16944585
 ]


benj commented on DRILL-7291:
-----------------------------

Indeed, surprisingly, I can't reproduce the problem with the file in attachment.
But I try to reproduce the problem with the original file (2.2Go) and all is 
fine with no compression or with snappy compression but with gzip compression 
it failed (but no exactly the same way as before):
{code:sql}
ALTER SESSION SET `store.format`='parquet';
ALTER SESSION SET `store.parquet.compression` = 'gzip';
CREATE TABLE ....`file_gzip_pqt` AS (SELECT * FROM ....`file_pqt`);

/* 1.16 (and 1.15) */
SELECT count(*) FROM ....`file_gzip_pqt`; /* OK */
SELECT count(*) FROM ....`file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: DATA_READ ERROR: Error reading page data

/* 1.17 */
SELECT count(*) FROM ....`file_gzip_pqt`; /* OK */
SELECT count(*) FROM ....`file_gzip_pqt` WHERE `Code`=20492; /* NOK */
Error: INTERNAL_ERROR ERROR: null
{code}

Note that {code:sql}SELECT count( * ) FROM ....`file_gzip_pqt` WHERE 
COLUMN=20492;{code} gives no error with 2 COLUMN (on 7) (?)

if forcing each column with {code:sql}CAST(COLUMN AS VARCHAR){code} when 
{code:sql}CREATE AS{code} it seems there is no problem after. 
But I think more investigation needed. I will try to push more elements.

> parquet with compression gzip doesn't work well
> -----------------------------------------------
>
>                 Key: DRILL-7291
>                 URL: https://issues.apache.org/jira/browse/DRILL-7291
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: benj
>            Priority: Major
>         Attachments: 0_0_0.parquet
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE ....`file_snappy_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE ....`file_gzip_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM ....`file_pqt`;        => 15728036
> SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C';      => 2
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

Reply via email to