Ence Wang created PARQUET-2424:
----------------------------------
Summary: Encrypted parquet files can't have more than 32767 pages
per chunk: 32768
Key: PARQUET-2424
URL: https://issues.apache.org/jira/browse/PARQUET-2424
Project: Parquet
Issue Type: Bug
Affects Versions: 1.13.1
Reporter: Ence Wang
Attachments: reproduce.zip
When we were writing an encrypted file, we encountered the following error:
{code:java}
Encrypted parquet files can't have more than 32767 pages per chunk: 32768
{code}
*Error Stack:*
{code:java}
org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet
files can't have more than 32767 pages per chunk: 32768
at
org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131)
at
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178)
at
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67)
at
org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392)
at
org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231)
at
org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216)
at
org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
at
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295){code}
*Reasons:*
The `getBufferedSize` method of
[FallbackValuesWriter|https://github.com/apache/parquet-mr/blob/19f284355847696fa254c789ab93c42db9af5982/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L73]
returns raw data size to decide if we want to flush the page,
so the actual size of the page written could be much more smaller due to
dictionary encoding. This prevents page being too big when fallback happens,
but can also produce too many pages in a single column chunk, while the
encryption module only support up to 32767 pages per chunk, because we use
`Short` to store page ordinal as a part of
[AAD|https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix].
{*}Reproduce:{*}{*}{*}
*[^reproduce.zip]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]