[ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945825#comment-16945825
 ] 

benj commented on DRILL-7291:
-----------------------------

Sure, sorry for the delay but I preferred to double check some points with the 
original file:
 - There is a column 'FileName' in the csvh => Renaming columns doesn't fix the 
problem.
 - There is microsoft end of line in the original csvh => Changing to unix end 
of line doesn't fix the problem.
 - There is some field without "quote surround" => Forcing quote everywhere 
doesn't fix the problem.
 - There is some binary data in the csvh. But the same problem appears with a 
small extract with no binary. So I prefer to push a small file rather than the 
big one.
 - And I have again double check on the last git 1.17 without fix the problem.

So in attachment there is a minimalist csvh file (1 row of header and 3 rows of 
data). I put also just below 
 [^short_no_binary_quote.csvh]
{code:java}
"sha1","md5","crc32","fn","fs","pc","osc","sc"
"0000000F8527DCCAB6642252BBCFA1B8072D33EE","68CE322D8A896B6E4E7E3F18339EC85C","E39149E4","Blended_Coolers_Vanilla_NL.png","30439","19042","362",""
"00000091728653B7D55DF30BFAFE86C52F2F4A59","81AE5D302A0E6D33182CB69ED791181C","5594C3B0","ic_menu_notifications.png","366","21386","362",""
"0000065F1900120613745CC5E25A57C84624DC2B","AEB7C147EF7B7CEE91807B500A378BA4","24400952","points_program_fragment.xml","1684","21842","362",""
{code}


{code:sql}
/*
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    },
*/
ALTER SESSION set exec.storage.enable_v3_text_reader=true;
ALTER SESSION SET `store.parquet.compression` = 'gzip';
SELECT * FROM ....`DRILL_7291/short_no_binary_quote.csvh`;
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+
|                   sha1                   |               md5                | 
 crc32   |               fn               |  fs   |  pc   | osc | sc |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+
| 0000000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | 
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 |    |
| 00000091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 
5594C3B0 | ic_menu_notifications.png      | 366   | 21386 | 362 |    |
| 0000065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 
24400952 | points_program_fragment.xml    | 1684  | 21842 | 362 |    |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+

CREATE TABLE ....`DRILL_7291/problem_pqt` AS( SELECT * FROM 
....`DRILL_7291/short_no_binary_quote.csvh`);
+----------+---------------------------+
| Fragment | Number of records written |
+----------+---------------------------+
| 0_0      | 3                         |
+----------+---------------------------+

SELECT * FROM ....`DRILL_7291/problem_pqt`;
Error: DATA_READ ERROR: Error reading page data
{code}

And All work's fine with 'snappy' or 'none'

> parquet with compression gzip doesn't work well
> -----------------------------------------------
>
>                 Key: DRILL-7291
>                 URL: https://issues.apache.org/jira/browse/DRILL-7291
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: benj
>            Priority: Major
>         Attachments: 0_0_0.parquet, short_no_binary_quote.csvh
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE ....`file_snappy_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE ....`file_gzip_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM ....`file_pqt`;        => 15728036
> SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C';      => 2
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to