[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458523#comment-17458523 ]
Jorge Leitão commented on ARROW-15073: -------------------------------------- the ZSTD I tested it myself as I wanted to understand if it was only LZ4 or others too. But I was sloppy and did not read the error properly. So ZSTD is just an pyspark-specific thing: > Caused by: java.lang.RuntimeException: native zStandard library not > available: this version of libhadoop was built without zstd support. and we can ignore in this issue (sorry for the noise :/) Wrt to the LZ4, I think that you are both right: it is probable that the compiled thrift no longer contains the LZ4 and it is thus being ignored when reading. For example, rust's parquet2 crate (which does not read the file too) uses https://github.com/jorgecarleitao/parquet-format-rs/blob/master/parquet.thrift#L479 which contains LZ4 but it says it was added in 2.4, so this probably corresponds to newer LZ4_RAW format(?) > [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable > by (py)spark > ------------------------------------------------------------------------------------------ > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet > Reporter: Jorge Leitão > Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)