[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

Jira Mon, 13 Dec 2021 08:33:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458523#comment-17458523
 ]


Jorge Leitão commented on ARROW-15073:
--------------------------------------

the ZSTD I tested it myself as I wanted to understand if it was only LZ4 or 
others too. But I was sloppy and did not read the error properly. So ZSTD is 
just an pyspark-specific thing:

> Caused by: java.lang.RuntimeException: native zStandard library not 
> available: this version of libhadoop was built without zstd support.

and we can ignore in this issue (sorry for the noise :/)

Wrt to the LZ4, I think that you are both right: it is probable that the 
compiled thrift no longer contains the LZ4 and it is thus being ignored when 
reading. For example, rust's parquet2 crate (which does not read the file too) 
uses 
https://github.com/jorgecarleitao/parquet-format-rs/blob/master/parquet.thrift#L479
 which contains LZ4 but it says it was added in 2.4, so this probably 
corresponds to newer LZ4_RAW format(?)

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15073
>                 URL: https://issues.apache.org/jira/browse/ARROW-15073
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
>     [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
>     schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
>     t,
>     path,
>     use_dictionary=False,
>     compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

Reply via email to