Piyush Narang created PARQUET-624:
-------------------------------------

             Summary: Value count used for memSize calculation in 
ColumnWriterV1 can be skewed based on first 100 values
                 Key: PARQUET-624
                 URL: https://issues.apache.org/jira/browse/PARQUET-624
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
            Reporter: Piyush Narang
            Assignee: Piyush Narang


While digging into some OOMs that we were seeing for some of our Parquet writer 
jobs, I noticed that we were writing out around 250MB+ of data for a single 
column as one page. Our page size threshold is set to 1MB so this should 
actually result in a few hundred pages instead of just 1. 

This seems to be due to the code in: 
[ColumnWriterV1.accountForValueWritten()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L93].
 We only check if we've crossed the memory threshold if the valueCount exceeds 
the valueCountForNextSizeCheck. However, valueCountForNextSizeCheck can end up 
getting skewed substantially if the memSize of the first 100 values of the 
column is really small:
For example, I see this in one of our jobs:
{code}
[foo_column] valueCount: 101, memSize: 16, pageSizeThreshold: 1048576

valueCountForNextSizeCheck = (int)(valueCount + ((float)valueCount * 
props.getPageSizeThreshold() / memSize)) / 2 + 1;

[foo_column] valueCountForNextSizeCheck = 3309619
{code}

This really large new valueCountForNextSizeCheck, results in our job OOMing as 
we end up seeing more space consuming values much much earlier than the ~3M 
valueCount point. 

At this point, I'm thinking of doing something simple which is similar to 
[InternalParquetRecordWriter.checkBlockSizeReached()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L143],
 basically cap the maximum value of the valueCountForNextSizeCheck:
{code}
valueCountForNextSizeCheck =
          Math.min(
            (int)(valueCount + ((float)valueCount * pageSizeThreshold / 
memSize)) / 2 + 1,
            valueCount + MAX_COUNT_FOR_SIZE_CHECK // will not look more than 
max records ahead
          );
{code}

Open to something more sophisticated if people prefer so. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to