[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
benj updated DRILL-7291: ------------------------ Component/s: Documentation > parquet with compression gzip doesn't work well > ----------------------------------------------- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Documentation, Storage - Parquet > Affects Versions: 1.15.0, 1.16.0 > Reporter: benj > Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh, > sqlline_error.log > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE ....`file_snappy_pqt` > AS(SELECT * FROM ....`file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE ....`file_gzip_pqt` > AS(SELECT * FROM ....`file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM ....`file_pqt`; => 15728036 > SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM ....`file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)