[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

benj (Jira) Fri, 11 Oct 2019 07:28:14 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949501#comment-16949501
 ]


benj commented on DRILL-7291:
-----------------------------

Exactly done the same as you (see below), but obtain an "+Error: INTERNAL_ERROR 
ERROR: null+" at the end
{code:sql}
./drill-embedded
Apache Drill 1.17.0-SNAPSHOT
"A Drill in the hand is better than two in the bush."
apache drill> ALTER SESSION SET `store.parquet.compression` = 'gzip';
+------+------------------------------------+
|  ok  |              summary               |
+------+------------------------------------+
| true | store.parquet.compression updated. |
+------+------------------------------------+
1 row selected (0.339 seconds)
apache drill> select * from dfs.tmp.`short_no_binary_quote.csvh`;
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+
|                   sha1                   |               md5                | 
 crc32   |               fn               |  fs   |  pc   | osc | sc |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+
| 0000000F8527DCCAB6642252BBCFA1B8072D33EE | 68CE322D8A896B6E4E7E3F18339EC85C | 
E39149E4 | Blended_Coolers_Vanilla_NL.png | 30439 | 19042 | 362 |    |
| 00000091728653B7D55DF30BFAFE86C52F2F4A59 | 81AE5D302A0E6D33182CB69ED791181C | 
5594C3B0 | ic_menu_notifications.png      | 366   | 21386 | 362 |    |
| 0000065F1900120613745CC5E25A57C84624DC2B | AEB7C147EF7B7CEE91807B500A378BA4 | 
24400952 | points_program_fragment.xml    | 1684  | 21842 | 362 |    |
+------------------------------------------+----------------------------------+----------+--------------------------------+-------+-------+-----+----+
3 rows selected (1.008 seconds)
apache drill> use dfs.tmp;
+------+-------------------------------------+
|  ok  |               summary               |
+------+-------------------------------------+
| true | Default schema changed to [dfs.tmp] |
+------+-------------------------------------+
1 row selected (0.087 seconds)
apache drill (dfs.tmp)> create table t as select * from 
dfs.tmp.`short_no_binary_quote.csvh`;
+----------+---------------------------+
| Fragment | Number of records written |
+----------+---------------------------+
| 0_0      | 3                         |
+----------+---------------------------+
1 row selected (0.306 seconds)
apache drill (dfs.tmp)> select * from t;
Error: INTERNAL_ERROR ERROR: null

Fragment 0:0
{code}

But, the parquet seems OK

{code:bash}
hadoop jar parquet-tools-1.10.0.jar cat /tmp/t/0_0_0.parquet
2019-10-11 15:48:14,047 INFO hadoop.InternalParquetRecordReader: RecordReader 
initialized will read a total of 3 records.
2019-10-11 15:48:14,048 INFO hadoop.InternalParquetRecordReader: at row 0. 
reading next block
2019-10-11 15:48:14,063 INFO zlib.ZlibFactory: Successfully loaded & 
initialized native-zlib library
2019-10-11 15:48:14,063 INFO compress.CodecPool: Got brand-new decompressor 
[.gz]
2019-10-11 15:48:14,068 INFO hadoop.InternalParquetRecordReader: block read in 
memory in 19 ms. row count = 3
sha1 = 0000000F8527DCCAB6642252BBCFA1B8072D33EE
md5 = 68CE322D8A896B6E4E7E3F18339EC85C
crc32 = E39149E4
fn = Blended_Coolers_Vanilla_NL.png
fs = 30439
pc = 19042
osc = 362
sc = 

sha1 = 00000091728653B7D55DF30BFAFE86C52F2F4A59
md5 = 81AE5D302A0E6D33182CB69ED791181C
crc32 = 5594C3B0
fn = ic_menu_notifications.png
fs = 366
pc = 21386
osc = 362
sc = 

sha1 = 0000065F1900120613745CC5E25A57C84624DC2B
md5 = AEB7C147EF7B7CEE91807B500A378BA4
crc32 = 24400952
fn = points_program_fragment.xml
fs = 1684
pc = 21842
osc = 362
sc =

hadoop jar parquet-tools-1.10.0.jar meta /tmp/t/0_0_0.parquet
2019-10-11 15:51:25,255 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: reading another 1 footers
2019-10-11 15:51:25,256 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:        file:/tmp/t/0_0_0.parquet 
creator:     parquet-mr version 1.10.0 (build ${buildNumber}) 
extra:       drill-writer.version = 3 
extra:       drill.version = 1.17.0-SNAPSHOT 

file schema: root 
--------------------------------------------------------------------------------
sha1:        REQUIRED BINARY O:UTF8 R:0 D:0
md5:         REQUIRED BINARY O:UTF8 R:0 D:0
crc32:       REQUIRED BINARY O:UTF8 R:0 D:0
fn:          REQUIRED BINARY O:UTF8 R:0 D:0
fs:          REQUIRED BINARY O:UTF8 R:0 D:0
pc:          REQUIRED BINARY O:UTF8 R:0 D:0
osc:         REQUIRED BINARY O:UTF8 R:0 D:0
sc:          REQUIRED BINARY O:UTF8 R:0 D:0

row group 1: RC:3 TS:914 OFFSET:4 
--------------------------------------------------------------------------------
sha1:         BINARY GZIP DO:0 FPO:4 SZ:210/239/1,14 VC:3 ENC:BIT_PACKED,PLAIN 
ST:[min: 0000000F8527DCCAB6642252BBCFA1B8072D33EE, max: 
0000065F1900120613745CC5E25A57C84624DC2B, num_nulls: 0]
md5:          BINARY GZIP DO:0 FPO:214 SZ:189/199/1,05 VC:3 
ENC:BIT_PACKED,PLAIN ST:[min: 68CE322D8A896B6E4E7E3F18339EC85C, max: 
AEB7C147EF7B7CEE91807B500A378BA4, num_nulls: 0]
crc32:        BINARY GZIP DO:0 FPO:403 SZ:93/77/0,83 VC:3 ENC:BIT_PACKED,PLAIN 
ST:[min: 24400952, max: E39149E4, num_nulls: 0]
fn:           BINARY GZIP DO:0 FPO:496 SZ:186/178/0,96 VC:3 
ENC:BIT_PACKED,PLAIN ST:[min: Blended_Coolers_Vanilla_NL.png, max: 
points_program_fragment.xml, num_nulls: 0]
fs:           BINARY GZIP DO:0 FPO:682 SZ:72/56/0,78 VC:3 ENC:BIT_PACKED,PLAIN 
ST:[min: 1684, max: 366, num_nulls: 0]
pc:           BINARY GZIP DO:0 FPO:754 SZ:76/62/0,82 VC:3 ENC:BIT_PACKED,PLAIN 
ST:[min: 19042, max: 21842, num_nulls: 0]
osc:          BINARY GZIP DO:0 FPO:830 SZ:70/62/0,89 VC:3 ENC:BIT_PACKED,PLAIN 
ST:[min: 362, max: 362, num_nulls: 0]
sc:           BINARY GZIP DO:0 FPO:900 SZ:52/41/0,79 VC:3 ENC:BIT_PACKED,PLAIN 
ST:[min: , max: , num_nulls: 0]

hadoop jar parquet-tools-1.10.0.jar schema /tmp/t/0_0_0.parquet
message root {
  required binary sha1 (UTF8);
  required binary md5 (UTF8);
  required binary crc32 (UTF8);
  required binary fn (UTF8);
  required binary fs (UTF8);
  required binary pc (UTF8);
  required binary osc (UTF8);
  required binary sc (UTF8);
{code}

ZLIB_VERSION : 1.2.11
gzip 1.6
java version "1.8.0_191", openjdk version "1.8.0_222"


> parquet with compression gzip doesn't work well
> -----------------------------------------------
>
>                 Key: DRILL-7291
>                 URL: https://issues.apache.org/jira/browse/DRILL-7291
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: benj
>            Priority: Major
>         Attachments: 0_0_0.parquet, short_no_binary_quote.csvh
>
>
> Create a parquet with compression=gzip produce bad result.
> Example:
>  * input: file_pqt (compression=none)
> {code:java}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> CREATE TABLE ....`file_snappy_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);
> ALTER SESSION SET `store.parquet.compression` = 'gzip';
> CREATE TABLE ....`file_gzip_pqt` 
>  AS(SELECT * FROM ....`file_pqt`);{code}
> Then compare the content of the different parquet files:
> {code:java}
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> SELECT COUNT(*) FROM ....`file_pqt`;        => 15728036
> SELECT COUNT(*) FROM ....`file_snappy_pqt`; => 15728036
> SELECT COUNT(*) FROM ....`file_gzip_pqt`;   => 15728036
> => OK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code` = '';   => 14744966
> => NOK
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `Code2` = '';        => 0
> SELECT COUNT(*) FROM ....`file_snappy_pqt` WHERE `Code2` = ''; => 0
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `Code2` = '';   => 14744921
> => NOK{code}
> _(There is no NULL value in these files.)_
>  _(With exec.storage.enable_v3_text_reader=true it gives same results)_
> So If the parquet file contains the right number of rows, the values in the 
> different columns are not identical.
> Some "random" values of the _gzip parquet_ are reduce to empty string
> I think the problem is from the reader and not the writer because:
> {code:java}
> SELECT COUNT(*) FROM ....`file_pqt` WHERE `CRC32` = 'B33D600C';      => 2
> SELECT COUNT(*) FROM ....`file_gzip_pqt` WHERE `CRC32` = 'B33D600C'; => 0
> {code}
> but
> {code:java}
> hadoop jar parquet-tools-1.10.0.jar cat file_gzip_pqt/1_0_0.parquet | grep -c 
> "B33D600C"
> 2019-06-12 11:45:23,738 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 3597092 records.
> 2019-06-12 11:45:23,739 INFO hadoop.InternalParquetRecordReader: at row 0. 
> reading next block
> 2019-06-12 11:45:23,804 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
> 2019-06-12 11:45:23,805 INFO compress.CodecPool: Got brand-new decompressor 
> [.gz]
> 2019-06-12 11:45:23,815 INFO hadoop.InternalParquetRecordReader: block read 
> in memory in 76 ms. row count = 3597092
> 2
> {code}
>  So the values are well present in the _Apache Parquet_ file but can't be 
> exploited via _Apache Drill_.
> In attachment an extract (the original file is 2.2 Go) which produce the same 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7291) parquet with compression gzip doesn't work well

Reply via email to