[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692437#comment-17692437 ]
ASF GitHub Bot commented on PARQUET-2164: ----------------------------------------- wgtmac commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1115168960 ########## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ########## @@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; + // check for overflow + try { + Math.addExact(bytesUsed, minimumSize); + } catch (ArithmeticException e) { + // This is interpreted as a request for a value greater than Integer.MAX_VALUE + // We throw OOM because that is what java.io.ByteArrayOutputStream also does + throw new OutOfMemoryError("Size of data exceeded 2GB (" + e.getMessage() + ")"); Review Comment: If we simply do an overflow check here, then the error message should say `Integer.MAX_VALUE` instead of `2GB`. Otherwise, we should explicitly check if the addition result exceeds 2GB. WDYT? > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -------------------------------------------------------------------------------------------------- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.12.2 > Reporter: Parth Chandra > Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)