Ence Wang created PARQUET-2424:
----------------------------------

             Summary: Encrypted parquet files can't have more than 32767 pages 
per chunk: 32768
                 Key: PARQUET-2424
                 URL: https://issues.apache.org/jira/browse/PARQUET-2424
             Project: Parquet
          Issue Type: Bug
    Affects Versions: 1.13.1
            Reporter: Ence Wang
         Attachments: reproduce.zip

When we were writing an encrypted file, we encountered the following error:
{code:java}
Encrypted parquet files can't have more than 32767 pages per chunk: 32768
{code}
 

*Error Stack:*
{code:java}
org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet 
files can't have more than 32767 pages per chunk: 32768

        at 
org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131)
        at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178)
        at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67)
        at 
org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392)
        at 
org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231)
        at 
org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216)
        at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
        at 
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295){code}
 

*Reasons:*
The `getBufferedSize` method of 
[FallbackValuesWriter|https://github.com/apache/parquet-mr/blob/19f284355847696fa254c789ab93c42db9af5982/parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java#L73]
returns raw data size to decide if we want to flush the page, 
so the actual size of the page written could be much more smaller due to 
dictionary encoding. This prevents page being too big when fallback happens, 
but can also produce too many pages in a single column chunk, while the 
encryption module only support up to  32767 pages per chunk, because we use 
`Short` to store page ordinal as a part of  
[AAD|https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix].
 
 
{*}Reproduce:{*}{*}{*}
*[^reproduce.zip]*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to