Jorge Leitão created ARROW-15073: ------------------------------------ Summary: [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark Key: ARROW-15073 URL: https://issues.apache.org/jira/browse/ARROW-15073 Project: Apache Arrow Issue Type: Bug Reporter: Jorge Leitão
The following snipped shows the issue {code:java} import pyarrow as pa # pyarrow==6.0.1 import pyarrow.parquet import pyspark.sql # pyspark==3.1.2 path = "bla.parquet" t = pa.table( [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), ) pyarrow.parquet.write_table( t, path, use_dictionary=False, compression="LZ4", ) spark = pyspark.sql.SparkSession.builder.getOrCreate() result = spark.read.parquet(path).select("int64").collect() {code} This fails with a failure in the Thrift protocol: {code:java} Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]) {code} Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018 -- This message was sent by Atlassian Jira (v8.20.1#820001)