[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458534#comment-17458534 ]
Jorge Leitão commented on ARROW-15073: -------------------------------------- No, closed. Thank you for your input! > [C++][Parquet][Python] LZ4- compressed parquet files are unreadable by > (py)spark > -------------------------------------------------------------------------------- > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet > Reporter: Jorge Leitão > Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)