[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- compressed parquet files are unreadable by (py)spark

Jira Mon, 13 Dec 2021 08:53:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458534#comment-17458534
 ]


Jorge Leitão commented on ARROW-15073:
--------------------------------------

No, closed. Thank you for your input!


> [C++][Parquet][Python] LZ4- compressed parquet files are unreadable by 
> (py)spark
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-15073
>                 URL: https://issues.apache.org/jira/browse/ARROW-15073
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
>     [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
>     schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
>     t,
>     path,
>     use_dictionary=False,
>     compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- compressed parquet files are unreadable by (py)spark

Reply via email to