[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949501#comment-16949501 ]
benj commented on DRILL-7291: ----------------------------- Exactly done the same as you (see below), but obtain an "+Error: INTERNAL_ERROR ERROR: null+" at the end {code:sql} ./drill-embedded Apache Drill 1.17.0-SNAPSHOT "A Drill in the hand is better than two in the bush." apache drill> ALTER SESSION SET `store.parquet.compression` = 'gzip'; +------+------------------------------------+ | ok | summary | +------+------------------------------------+ | true | store.parquet.compression updated. | +------+------------------------------------+ 1 row selected (0.339 seconds) apache drill> select * from dfs.tmp.`short_no_binary_quote.csvh`; +------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+ | sha1 | md5 | crc32 | fn | fs | pc | osc | sc | +------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+ | 0000000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 | | | 00000091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 | | | 0000065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 24400952 | points_program_fragment.xml | 1684 | 21842 | 362 | | +------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+ 3 rows selected (1.008 seconds) apache drill> use dfs.tmp; +------+-------------------------------------+ | ok | summary | +------+-------------------------------------+ | true | Default schema changed to [dfs.tmp] | +------+-------------------------------------+ 1 row selected (0.087 seconds) apache drill (dfs.tmp)> create table t as select * from dfs.tmp.`short_no_binary_quote.csvh`; +----------+---------------------------+ | Fragment | Number of records written | +----------+---------------------------+ | 0_0 | 3 | +----------+---------------------------+ 1 row selected (0.306 seconds) apache drill (dfs.tmp)> select * from t; Error: INTERNAL_ERROR ERROR: null Fragment 0:0 {code} But, the parquet seems OK {code:bash} hadoop jar parquet-tools-1.10.0.jar cat /tmp/t/0_0_0.parquet 2019-10-11 15:48:14,047 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 3 records. 2019-10-11 15:48:14,048 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 2019-10-11 15:48:14,063 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2019-10-11 15:48:14,063 INFO compress.CodecPool: Got brand-new decompressor [.gz] 2019-10-11 15:48:14,068 INFO hadoop.InternalParquetRecordReader: block read in memory in 19 ms. row count = 3 sha1 = 0000000F8527DCCAB6642252BBCFA1B8072D33EE md5 = 68CE322D8A896B6E4E7E3F18339EC85C crc32 = E39149E4 fn = Blended_Coolers_Vanilla_NL.png fs = 30439 pc = 19042 osc = 362 sc = sha1 = 00000091728653B7D55DF30BFAFE86C52F2F4A59 md5 = 81AE5D302A0E6D33182CB69ED791181C crc32 = 5594C3B0 fn = ic_menu_notifications.png fs = 366 pc = 21386 osc = 362 sc = sha1 = 0000065F1900120613745CC5E25A57C84624DC2B md5 = AEB7C147EF7B7CEE91807B500A378BA4 crc32 = 24400952 fn = points_program_fragment.xml fs = 1684 pc = 21842 osc = 362 sc = hadoop jar parquet-tools-1.10.0.jar meta /tmp/t/0_0_0.parquet 2019-10-11 15:51:25,255 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: reading another 1 footers 2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: file:/tmp/t/0_0_0.parquet creator: parquet-mr version 1.10.0 (build ${buildNumber}) extra: drill-writer.version = 3 extra: drill.version = 1.17.0-SNAPSHOT file schema: root -------------------------------------------------------------------------------- sha1: REQUIRED BINARY O:UTF8 R:0 D:0 md5: REQUIRED BINARY O:UTF8 R:0 D:0 crc32: REQUIRED BINARY O:UTF8 R:0 D:0 fn: REQUIRED BINARY O:UTF8 R:0 D:0 fs: REQUIRED BINARY O:UTF8 R:0 D:0 pc: REQUIRED BINARY O:UTF8 R:0 D:0 osc: REQUIRED BINARY O:UTF8 R:0 D:0 sc: REQUIRED BINARY O:UTF8 R:0 D:0 row group 1: RC:3 TS:914 OFFSET:4 -------------------------------------------------------------------------------- sha1: BINARY GZIP DO:0 FPO:4 SZ:210/239/1,14 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: 0000000F8527DCCAB6642252BBCFA1B8072D33EE, max: 0000065F1900120613745CC5E25A57C84624DC2B, num_nulls: 0] md5: BINARY GZIP DO:0 FPO:214 SZ:189/199/1,05 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: 68CE322D8A896B6E4E7E3F18339EC85C, max: AEB7C147EF7B7CEE91807B500A378BA4, num_nulls: 0] crc32: BINARY GZIP DO:0 FPO:403 SZ:93/77/0,83 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: 24400952, max: E39149E4, num_nulls: 0] fn: BINARY GZIP DO:0 FPO:496 SZ:186/178/0,96 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: Blended_Coolers_Vanilla_NL.png, max: points_program_fragment.xml, num_nulls: 0] fs: BINARY GZIP DO:0 FPO:682 SZ:72/56/0,78 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: 1684, max: 366, num_nulls: 0] pc: BINARY GZIP DO:0 FPO:754 SZ:76/62/0,82 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: 19042, max: 21842, num_nulls: 0] osc: BINARY GZIP DO:0 FPO:830 SZ:70/62/0,89 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: 362, max: 362, num_nulls: 0] sc: BINARY GZIP DO:0 FPO:900 SZ:52/41/0,79 VC:3 ENC:BIT_PACKED,PLAIN ST:[min: , max: , num_nulls: 0] hadoop jar parquet-tools-1.10.0.jar schema /tmp/t/0_0_0.parquet message root { required binary sha1 (UTF8); required binary md5 (UTF8); required binary crc32 (UTF8); required binary fn (UTF8); required binary fs (UTF8); required binary pc (UTF8); required binary osc (UTF8); required binary sc (UTF8); {code} ZLIB_VERSION : 1.2.11 gzip 1.6 java version "1.8.0_191", openjdk version "1.8.0_222" > parquet with compression gzip doesn't work well > ----------------------------------------------- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.15.0, 1.16.0 > Reporter: benj > Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE ....`file_snappy_pqt` > AS(SELECT * FROM ....`file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE ....`file_gzip_pqt` > AS(SELECT * FROM ....`file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM ....`file_pqt`; => 15728036 > SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM ....`file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)