[ https://issues.apache.org/jira/browse/PARQUET-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760750#comment-17760750 ]
ASF GitHub Bot commented on PARQUET-2342: ----------------------------------------- majdyz commented on PR #1135: URL: https://github.com/apache/parquet-mr/pull/1135#issuecomment-1700478849 Thanks @wgtmac, let me know if there is anything needed from my end to proceed and get this merged > Parquet writer produced a corrupted file due to page value count overflow > ------------------------------------------------------------------------- > > Key: PARQUET-2342 > URL: https://issues.apache.org/jira/browse/PARQUET-2342 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Reporter: Zamil Majdy > Priority: Major > > Parquet writer only checks the number of rows and the page size to decide > whether it needs to fit a content to be written in a single page. > In the case of a composite column (ex: array/map) with a lot of nulls, it is > possible to create 2billions+ values while under the default page-size & > row-count threshold (1MB, 20000rows) > > Repro using Spark: > {{ val dir = "/tmp/anyrandomDirectory"}} > {{ spark.range(0, 20000, 1, 1)}} > {{ .selectExpr("array_repeat(cast(null as binary), 110000) as n")}} > {{ .write}} > {{ .mode("overwrite")}} > {{ .save(dir)}} > {{ val result = spark}} > {{ .sql(s"select * from parquet.`$dir` limit 1000")}} > {{ .collect() // This will break}} -- This message was sent by Atlassian Jira (v8.20.10#820010)