[jira] [Commented] (ARROW-9177) [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet compression compatibility

Steve M. Kim (Jira) Fri, 08 Jan 2021 19:19:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261737#comment-17261737
 ]


Steve M. Kim commented on ARROW-9177:
-------------------------------------

I think that this issue is still not resolved. Using pyarrow 2.0.0, I can not 
read a parquet file that was written with the lz compression using Java 
parquet-mr 1.10.1.

Metadata for the Parquet file:
{code:java}
$ parquet-tools meta 
/tmp/b3caf545-3a3e-4c3e-9542-9ecbca341306/e86fa357-6e1f-4acd-945c-59d558a9434a
file:        
file:/tmp/b3caf545-3a3e-4c3e-9542-9ecbca341306/e86fa357-6e1f-4acd-945c-59d558a9434a
creator:     parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1)

file schema:
--------------------------------------------------------------------------------
c1:          REQUIRED BINARY R:0 D:0
c0:          REQUIRED INT64 R:0 D:0
c2:          REQUIRED INT32 R:0 D:0
c3:          REQUIRED BINARY R:0 D:0
v1:          OPTIONAL INT64 R:0 D:1

row group 1: RC:3330100 TS:33837867 OFFSET:4
--------------------------------------------------------------------------------
c1:           BINARY LZ4 DO:0 FPO:4 SZ:27496/26246/0.95 VC:3330100 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 
0x4151556B4E6B454A7156675642466570637A556C5067704746686D5354424276766866726F73616370,
 max: 
0x666F58594E7856684F4765684265714C58785A5464446B76526C4D5358635968576A586E494B727059,
 num_nulls: 0]
c0:           INT64 LZ4 DO:0 FPO:27500 SZ:4434451/5905528/1.33 VC:3330100 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 8639, num_nulls: 0]
c2:           INT32 LZ4 DO:0 FPO:4461951 SZ:4401/839852/190.83 VC:3330100 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 2, num_nulls: 0]
c3:           BINARY LZ4 DO:0 FPO:4466352 SZ:2645/423441/160.09 VC:3330100 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0x41, max: 0x42, num_nulls: 0]
v1:           INT64 LZ4 DO:0 FPO:4468997 SZ:26748724/26642800/1.00 VC:3330100 
ENC:RLE,PLAIN,BIT_PACKED ST:[min: -9223371059640220560, max: 
9223371048678785692, num_nulls: 0]

row group 2: RC:1698380 TS:17291569 OFFSET:31217721
--------------------------------------------------------------------------------
c1:           BINARY LZ4 DO:0 FPO:31217721 SZ:14103/13453/0.95 VC:1698380 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 
0x666F58594E7856684F4765684265714C58785A5464446B76526C4D5358635968576A586E494B727059,
 max: 
0x7847426D495659786A7277477A756C4A4172586C66724551435547474145716A716A675342426C6761,
 num_nulls: 0]
c0:           INT64 LZ4 DO:0 FPO:31231824 SZ:2272242/3045717/1.34 VC:1698380 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 8639, num_nulls: 0]
c2:           INT32 LZ4 DO:0 FPO:33504066 SZ:2286/428364/187.39 VC:1698380 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 2, num_nulls: 0]
c3:           BINARY LZ4 DO:0 FPO:33506352 SZ:1413/215996/152.86 VC:1698380 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0x41, max: 0x42, num_nulls: 0]
v1:           INT64 LZ4 DO:0 FPO:33507765 SZ:13642061/13588039/1.00 VC:1698380 
ENC:RLE,PLAIN,BIT_PACKED ST:[min: -9223351962781870274, max: 
9223365078578074233, num_nulls: 0] {code}

What happen when I try to read this file with pyarrow.parquet:
{code:java}
Python 3.6.11 | packaged by conda-forge | (default, Jul 28 2020, 23:15:00)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> pa.__version__
'2.0.0'
>>> import pyarrow.parquet as pq
>>> pq.read_table('/tmp/b3caf545-3a3e-4c3e-9542-9ecbca341306/e86fa357-6e1f-4acd-945c-59d558a9434a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/skim/.local/Miniconda3-py38_4.8.3-Linux-x86_64/envs/pyarrow/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 1639, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File 
"/home/skim/.local/Miniconda3-py38_4.8.3-Linux-x86_64/envs/pyarrow/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 1517, in read
    use_threads=use_threads
  File "pyarrow/_dataset.pyx", line 405, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2262, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: IOError: Corrupt Lz4 compressed data. {code}

> [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet 
> compression compatibility
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9177
>                 URL: https://issues.apache.org/jira/browse/ARROW-9177
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Antoine Pitrou
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> per PARQUET-1878, it seems that there are still problems with our use of LZ4 
> compression in the Parquet format. While we should fix this (the Parquet 
> specification and our implementation of it), we may need to disable use of 
> LZ4 compression until the appropriate compatibility testing can bed one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9177) [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet compression compatibility

Reply via email to