[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-03-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699221#comment-17699221
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

wgtmac merged PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032




> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693283#comment-17693283
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

parthchandra commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1117357547


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+// check for overflow 
+try {
+  Math.addExact(bytesUsed, minimumSize);
+} catch (ArithmeticException e) {
+  // This is interpreted as a request for a value greater than 
Integer.MAX_VALUE
+  // We throw OOM because that is what java.io.ByteArrayOutputStream also 
does
+  throw new OutOfMemoryError("Size of data exceeded 2GB (" + 
e.getMessage() + ")");

Review Comment:
   Let me just change the message.





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692437#comment-17692437
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

wgtmac commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1115168960


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+// check for overflow 
+try {
+  Math.addExact(bytesUsed, minimumSize);
+} catch (ArithmeticException e) {
+  // This is interpreted as a request for a value greater than 
Integer.MAX_VALUE
+  // We throw OOM because that is what java.io.ByteArrayOutputStream also 
does
+  throw new OutOfMemoryError("Size of data exceeded 2GB (" + 
e.getMessage() + ")");

Review Comment:
   If we simply do an overflow check here, then the error message should say 
`Integer.MAX_VALUE` instead of `2GB`. Otherwise, we should explicitly check if 
the addition result exceeds 2GB. WDYT?





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692284#comment-17692284
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

parthchandra commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1114670273


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+// check for overflow 
+try {
+  Math.addExact(bytesUsed, minimumSize);
+} catch (ArithmeticException e) {
+  // This is interpreted as a request for a value greater than 
Integer.MAX_VALUE
+  // We throw OOM because that is what java.io.ByteArrayOutputStream also 
does
+  throw new OutOfMemoryError("Size of data exceeded 2GB (" + 
e.getMessage() + ")");

Review Comment:
   The exception thrown by Math.exact will happen only if integer overflow 
occurs which happens when the data exceeds 2GB (i.e greater than 
`Integer.MAX_VALUE` ). 
   What were you thinking of changing the message to?





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692282#comment-17692282
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

parthchandra commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1114670273


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+// check for overflow 
+try {
+  Math.addExact(bytesUsed, minimumSize);
+} catch (ArithmeticException e) {
+  // This is interpreted as a request for a value greater than 
Integer.MAX_VALUE
+  // We throw OOM because that is what java.io.ByteArrayOutputStream also 
does
+  throw new OutOfMemoryError("Size of data exceeded 2GB (" + 
e.getMessage() + ")");

Review Comment:
   The exception thrown by Math.exact will happen only if integer overflow 
occurs which happens when the data exceeds 2GB (i.e greater than 
`Integer.MAX_VALUE` ). 
   What message were you thinking?





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691933#comment-17691933
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

wgtmac commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1113858179


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,15 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+// check for overflow 
+try {
+  Math.addExact(bytesUsed, minimumSize);
+} catch (ArithmeticException e) {
+  // This is interpreted as a request for a value greater than 
Integer.MAX_VALUE
+  // We throw OOM because that is what java.io.ByteArrayOutputStream also 
does
+  throw new OutOfMemoryError("Size of data exceeded 2GB (" + 
e.getMessage() + ")");

Review Comment:
   The error message mismatches the check above. Should we compare with 2GB 
explicitly or change the message?





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691820#comment-17691820
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

parthchandra commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1113658117


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,12 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+if (bytesUsed + minimumSize < 0) {

Review Comment:
   Updated





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691348#comment-17691348
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

wgtmac commented on code in PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#discussion_r1112489810


##
parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java:
##
@@ -164,6 +164,12 @@ public CapacityByteArrayOutputStream(int initialSlabSize, 
int maxCapacityHint, B
   private void addSlab(int minimumSize) {
 int nextSlabSize;
 
+if (bytesUsed + minimumSize < 0) {

Review Comment:
   nit: replace it with `Math.addExact` to hand over overflow check. Then we 
get the chance to check the 2GB hard limit before it becoming negative





> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691347#comment-17691347
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

cxzl25 commented on PR #1032:
URL: https://github.com/apache/parquet-mr/pull/1032#issuecomment-1437794593

   Is it also similar to #1031 ?




> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691281#comment-17691281
 ] 

ASF GitHub Bot commented on PARQUET-2164:
-

parthchandra opened a new pull request, #1032:
URL: https://github.com/apache/parquet-mr/pull/1032

   This PR addresses the following 
[PARQUET-2164](https://issues.apache.org/jira/browse/PARQUET-2164) 
   The configuration parameters 
   ```
   parquet.page.size.check.estimate=false
   parquet.page.size.row.check.min=
   parquet.page.size.row.check.max=
   ```
   address the reported problem. However the issue can still be hit because the 
default value of `parquet.page.size.check.estimate` is `true`.
   This PR simply adds a check to make sure that 
`CapacityByteArrayOutputStream` cannot overflow but will instead throw an 
exception.




> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2022-07-06 Thread Parth Chandra (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563441#comment-17563441
 ] 

Parth Chandra commented on PARQUET-2164:


This is related to but not the same as Parquet-2052. 

> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)