[ https://issues.apache.org/jira/browse/ARROW-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261737#comment-17261737 ]
Steve M. Kim commented on ARROW-9177: ------------------------------------- I think that this issue is still not resolved. Using pyarrow 2.0.0, I can not read a parquet file that was written with the lz compression using Java parquet-mr 1.10.1. Metadata for the Parquet file: {code:java} $ parquet-tools meta /tmp/b3caf545-3a3e-4c3e-9542-9ecbca341306/e86fa357-6e1f-4acd-945c-59d558a9434a file: file:/tmp/b3caf545-3a3e-4c3e-9542-9ecbca341306/e86fa357-6e1f-4acd-945c-59d558a9434a creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) file schema: -------------------------------------------------------------------------------- c1: REQUIRED BINARY R:0 D:0 c0: REQUIRED INT64 R:0 D:0 c2: REQUIRED INT32 R:0 D:0 c3: REQUIRED BINARY R:0 D:0 v1: OPTIONAL INT64 R:0 D:1 row group 1: RC:3330100 TS:33837867 OFFSET:4 -------------------------------------------------------------------------------- c1: BINARY LZ4 DO:0 FPO:4 SZ:27496/26246/0.95 VC:3330100 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0x4151556B4E6B454A7156675642466570637A556C5067704746686D5354424276766866726F73616370, max: 0x666F58594E7856684F4765684265714C58785A5464446B76526C4D5358635968576A586E494B727059, num_nulls: 0] c0: INT64 LZ4 DO:0 FPO:27500 SZ:4434451/5905528/1.33 VC:3330100 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 8639, num_nulls: 0] c2: INT32 LZ4 DO:0 FPO:4461951 SZ:4401/839852/190.83 VC:3330100 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 2, num_nulls: 0] c3: BINARY LZ4 DO:0 FPO:4466352 SZ:2645/423441/160.09 VC:3330100 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0x41, max: 0x42, num_nulls: 0] v1: INT64 LZ4 DO:0 FPO:4468997 SZ:26748724/26642800/1.00 VC:3330100 ENC:RLE,PLAIN,BIT_PACKED ST:[min: -9223371059640220560, max: 9223371048678785692, num_nulls: 0] row group 2: RC:1698380 TS:17291569 OFFSET:31217721 -------------------------------------------------------------------------------- c1: BINARY LZ4 DO:0 FPO:31217721 SZ:14103/13453/0.95 VC:1698380 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0x666F58594E7856684F4765684265714C58785A5464446B76526C4D5358635968576A586E494B727059, max: 0x7847426D495659786A7277477A756C4A4172586C66724551435547474145716A716A675342426C6761, num_nulls: 0] c0: INT64 LZ4 DO:0 FPO:31231824 SZ:2272242/3045717/1.34 VC:1698380 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 8639, num_nulls: 0] c2: INT32 LZ4 DO:0 FPO:33504066 SZ:2286/428364/187.39 VC:1698380 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 2, num_nulls: 0] c3: BINARY LZ4 DO:0 FPO:33506352 SZ:1413/215996/152.86 VC:1698380 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0x41, max: 0x42, num_nulls: 0] v1: INT64 LZ4 DO:0 FPO:33507765 SZ:13642061/13588039/1.00 VC:1698380 ENC:RLE,PLAIN,BIT_PACKED ST:[min: -9223351962781870274, max: 9223365078578074233, num_nulls: 0] {code} What happen when I try to read this file with pyarrow.parquet: {code:java} Python 3.6.11 | packaged by conda-forge | (default, Jul 28 2020, 23:15:00) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> pa.__version__ '2.0.0' >>> import pyarrow.parquet as pq >>> pq.read_table('/tmp/b3caf545-3a3e-4c3e-9542-9ecbca341306/e86fa357-6e1f-4acd-945c-59d558a9434a') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/skim/.local/Miniconda3-py38_4.8.3-Linux-x86_64/envs/pyarrow/lib/python3.6/site-packages/pyarrow/parquet.py", line 1639, in read_table use_pandas_metadata=use_pandas_metadata) File "/home/skim/.local/Miniconda3-py38_4.8.3-Linux-x86_64/envs/pyarrow/lib/python3.6/site-packages/pyarrow/parquet.py", line 1517, in read use_threads=use_threads File "pyarrow/_dataset.pyx", line 405, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 2262, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: IOError: Corrupt Lz4 compressed data. {code} > [C++][Parquet] Tracking issue for cross-implementation LZ4 Parquet > compression compatibility > -------------------------------------------------------------------------------------------- > > Key: ARROW-9177 > URL: https://issues.apache.org/jira/browse/ARROW-9177 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Wes McKinney > Assignee: Antoine Pitrou > Priority: Critical > Fix For: 2.0.0 > > > per PARQUET-1878, it seems that there are still problems with our use of LZ4 > compression in the Parquet format. While we should fix this (the Parquet > specification and our implementation of it), we may need to disable use of > LZ4 compression until the appropriate compatibility testing can bed one. -- This message was sent by Atlassian Jira (v8.3.4#803005)