[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699221#comment-17699221 ] ASF GitHub Bot commented on PARQUET-2164: - wgtmac merged PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032 > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693283#comment-17693283 ] ASF GitHub Bot commented on PARQUET-2164: - parthchandra commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1117357547 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; +// check for overflow +try { + Math.addExact(bytesUsed, minimumSize); +} catch (ArithmeticException e) { + // This is interpreted as a request for a value greater than Integer.MAX_VALUE + // We throw OOM because that is what java.io.ByteArrayOutputStream also does + throw new OutOfMemoryError("Size of data exceeded 2GB (" + e.getMessage() + ")"); Review Comment: Let me just change the message. > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692437#comment-17692437 ] ASF GitHub Bot commented on PARQUET-2164: - wgtmac commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1115168960 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; +// check for overflow +try { + Math.addExact(bytesUsed, minimumSize); +} catch (ArithmeticException e) { + // This is interpreted as a request for a value greater than Integer.MAX_VALUE + // We throw OOM because that is what java.io.ByteArrayOutputStream also does + throw new OutOfMemoryError("Size of data exceeded 2GB (" + e.getMessage() + ")"); Review Comment: If we simply do an overflow check here, then the error message should say `Integer.MAX_VALUE` instead of `2GB`. Otherwise, we should explicitly check if the addition result exceeds 2GB. WDYT? > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692284#comment-17692284 ] ASF GitHub Bot commented on PARQUET-2164: - parthchandra commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1114670273 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; +// check for overflow +try { + Math.addExact(bytesUsed, minimumSize); +} catch (ArithmeticException e) { + // This is interpreted as a request for a value greater than Integer.MAX_VALUE + // We throw OOM because that is what java.io.ByteArrayOutputStream also does + throw new OutOfMemoryError("Size of data exceeded 2GB (" + e.getMessage() + ")"); Review Comment: The exception thrown by Math.exact will happen only if integer overflow occurs which happens when the data exceeds 2GB (i.e greater than `Integer.MAX_VALUE` ). What were you thinking of changing the message to? > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692282#comment-17692282 ] ASF GitHub Bot commented on PARQUET-2164: - parthchandra commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1114670273 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; +// check for overflow +try { + Math.addExact(bytesUsed, minimumSize); +} catch (ArithmeticException e) { + // This is interpreted as a request for a value greater than Integer.MAX_VALUE + // We throw OOM because that is what java.io.ByteArrayOutputStream also does + throw new OutOfMemoryError("Size of data exceeded 2GB (" + e.getMessage() + ")"); Review Comment: The exception thrown by Math.exact will happen only if integer overflow occurs which happens when the data exceeds 2GB (i.e greater than `Integer.MAX_VALUE` ). What message were you thinking? > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691933#comment-17691933 ] ASF GitHub Bot commented on PARQUET-2164: - wgtmac commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1113858179 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; +// check for overflow +try { + Math.addExact(bytesUsed, minimumSize); +} catch (ArithmeticException e) { + // This is interpreted as a request for a value greater than Integer.MAX_VALUE + // We throw OOM because that is what java.io.ByteArrayOutputStream also does + throw new OutOfMemoryError("Size of data exceeded 2GB (" + e.getMessage() + ")"); Review Comment: The error message mismatches the check above. Should we compare with 2GB explicitly or change the message? > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691820#comment-17691820 ] ASF GitHub Bot commented on PARQUET-2164: - parthchandra commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1113658117 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -164,6 +164,12 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; +if (bytesUsed + minimumSize < 0) { Review Comment: Updated > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691348#comment-17691348 ] ASF GitHub Bot commented on PARQUET-2164: - wgtmac commented on code in PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1112489810 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -164,6 +164,12 @@ public CapacityByteArrayOutputStream(int initialSlabSize, int maxCapacityHint, B private void addSlab(int minimumSize) { int nextSlabSize; +if (bytesUsed + minimumSize < 0) { Review Comment: nit: replace it with `Math.addExact` to hand over overflow check. Then we get the chance to check the 2GB hard limit before it becoming negative > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691347#comment-17691347 ] ASF GitHub Bot commented on PARQUET-2164: - cxzl25 commented on PR #1032: URL: https://github.com/apache/parquet-mr/pull/1032#issuecomment-1437794593 Is it also similar to #1031 ? > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691281#comment-17691281 ] ASF GitHub Bot commented on PARQUET-2164: - parthchandra opened a new pull request, #1032: URL: https://github.com/apache/parquet-mr/pull/1032 This PR addresses the following [PARQUET-2164](https://issues.apache.org/jira/browse/PARQUET-2164) The configuration parameters ``` parquet.page.size.check.estimate=false parquet.page.size.row.check.min= parquet.page.size.row.check.max= ``` address the reported problem. However the issue can still be hit because the default value of `parquet.page.size.check.estimate` is `true`. This PR simply adds a check to make sure that `CapacityByteArrayOutputStream` cannot overflow but will instead throw an exception. > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
[ https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563441#comment-17563441 ] Parth Chandra commented on PARQUET-2164: This is related to but not the same as Parquet-2052. > CapacityByteArrayOutputStream overflow while writing causes negative row > group sizes to be written > -- > > Key: PARQUET-2164 > URL: https://issues.apache.org/jira/browse/PARQUET-2164 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Parth Chandra >Priority: Major > Fix For: 1.12.3 > > Attachments: TestLargeDictionaryWriteParquet.java > > > It is possible, while writing a parquet file, to cause > {{CapacityByteArrayOutputStream}} to overflow. > This is an extreme case but it has been observed in a real world data set. > The attached Spark program manages to reproduce the issue. > Short summary of how this happens - > 1. After many small records possibly including nulls, the dictionary page > fills up and subsequent pages are written using plain encoding > 2. The estimate of when to perform the page size check is based on the number > of values observed per page so far. Let's say this is about 100K > 3. A sequence of very large records shows up. Let's say each of these record > is 200K. > 4. After 11K of these records the size of the page has gone up beyond 2GB. > 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of > data but also it holds the size of the data in an int which overflows. > There are a couple of things to fix here - > 1. The check for page size should check both the number of values added as > well as the buffered size of the data > 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data > size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly > that). -- This message was sent by Atlassian Jira (v8.20.10#820010)