[ https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945825#comment-16945825 ]
benj commented on DRILL-7291: ----------------------------- Sure, sorry for the delay but I preferred to double check some points with the original file: - There is a column 'FileName' in the csvh => Renaming columns doesn't fix the problem. - There is microsoft end of line in the original csvh => Changing to unix end of line doesn't fix the problem. - There is some field without "quote surround" => Forcing quote everywhere doesn't fix the problem. - There is some binary data in the csvh. But the same problem appears with a small extract with no binary. So I prefer to push a small file rather than the big one. - And I have again double check on the last git 1.17 without fix the problem. So in attachment there is a minimalist csvh file (1 row of header and 3 rows of data). I put also just below [^short_no_binary_quote.csvh] {code:java} "sha1","md5","crc32","fn","fs","pc","osc","sc" "0000000F8527DCCAB6642252BBCFA1B8072D33EE","68CE322D8A896B6E4E7E3F18339EC85C","E39149E4","Blended_Coolers_Vanilla_NL.png","30439","19042","362","" "00000091728653B7D55DF30BFAFE86C52F2F4A59","81AE5D302A0E6D33182CB69ED791181C","5594C3B0","ic_menu_notifications.png","366","21386","362","" "0000065F1900120613745CC5E25A57C84624DC2B","AEB7C147EF7B7CEE91807B500A378BA4","24400952","points_program_fragment.xml","1684","21842","362","" {code} {code:sql} /* "csvh": { "type": "text", "extensions": [ "csvh" ], "extractHeader": true, "delimiter": "," }, */ ALTER SESSION set exec.storage.enable_v3_text_reader=true; ALTER SESSION SET `store.parquet.compression` = 'gzip'; SELECT * FROM ....`DRILL_7291/short_no_binary_quote.csvh`; +------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+ | sha1 | md5 | crc32 | fn | fs | pc | osc | sc | +------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+ | 0000000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 | | | 00000091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 5594C3B0 | ic_menu_notifications.png | 366 | 21386 | 362 | | | 0000065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 24400952 | points_program_fragment.xml | 1684 | 21842 | 362 | | +------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+ CREATE TABLE ....`DRILL_7291/problem_pqt` AS( SELECT * FROM ....`DRILL_7291/short_no_binary_quote.csvh`); +----------+---------------------------+ | Fragment | Number of records written | +----------+---------------------------+ | 0_0 | 3 | +----------+---------------------------+ SELECT * FROM ....`DRILL_7291/problem_pqt`; Error: DATA_READ ERROR: Error reading page data {code} And All work's fine with 'snappy' or 'none' > parquet with compression gzip doesn't work well > ----------------------------------------------- > > Key: DRILL-7291 > URL: https://issues.apache.org/jira/browse/DRILL-7291 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.15.0, 1.16.0 > Reporter: benj > Priority: Major > Attachments: 0_0_0.parquet, short_no_binary_quote.csvh > > > Create a parquet with compression=gzip produce bad result. > Example: > * input: file_pqt (compression=none) > {code:java} > ALTER SESSION SET `store.format`='parquet'; > ALTER SESSION SET `store.parquet.compression` = 'snappy'; > CREATE TABLE ....`file_snappy_pqt` > AS(SELECT * FROM ....`file_pqt`); > ALTER SESSION SET `store.parquet.compression` = 'gzip'; > CREATE TABLE ....`file_gzip_pqt` > AS(SELECT * FROM ....`file_pqt`);{code} > Then compare the content of the different parquet files: > {code:java} > ALTER SESSION SET `store.parquet.use_new_reader` = true; > SELECT COUNT(*) FROM ....`file_pqt`; => 15728036 > SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036 > SELECT COUNT(*) FROM ....`file_gzip_pqt`; => 15728036 > => OK > SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = ''; => 14744966 > => NOK > SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = ''; => 14744921 > => NOK{code} > _(There is no NULL value in these files.)_ > _(With exec.storage.enable_v3_text_reader=true it gives same results)_ > So If the parquet file contains the right number of rows, the values in the > different columns are not identical. > Some "random" values of the _gzip parquet_ are reduce to empty string > I think the problem is from the reader and not the writer because: > {code:java} > SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C'; => 2 > SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0 > {code} > but > {code:java} > hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c > "B33D600C" > 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 3597092 records. > 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. > reading next block > 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor > [.gz] > 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read > in memory in 76 ms. row count = 3597092 > 2 > {code} > So the values are well present in the _Apache Parquet_ file but can't be > exploited via _Apache Drill_. > In attachment an extract (the original file is 2.2 Go) which produce the same > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)