[
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691820#comment-17691820
]
ASF GitHub Bot commented on PARQUET-2164:
-----------------------------------------
parthchandra commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1113658117
##########
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##########
@@ -164,6 +164,12 @@ public CapacityByteArrayOutputStream(int initialSlabSize,
int maxCapacityHint, B
private void addSlab(int minimumSize) {
int nextSlabSize;
+ if (bytesUsed + minimumSize < 0) {
Review Comment:
Updated
> CapacityByteArrayOutputStream overflow while writing causes negative row
> group sizes to be written
> --------------------------------------------------------------------------------------------------
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.12.2
> Reporter: Parth Chandra
> Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens -
> 1. After many small records possibly including nulls, the dictionary page
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record
> is 200K.
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly
> that).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)