[ 
https://issues.apache.org/jira/browse/PARQUET-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piyush Narang resolved PARQUET-624.
-----------------------------------
    Resolution: Not A Problem

Closing this as there is a way to work around which I missed. 

> Value count used for memSize calculation in ColumnWriterV1 can be skewed 
> based on first 100 values
> --------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-624
>                 URL: https://issues.apache.org/jira/browse/PARQUET-624
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Piyush Narang
>            Assignee: Piyush Narang
>
> While digging into some OOMs that we were seeing for some of our Parquet 
> writer jobs, I noticed that we were writing out around 250MB+ of data for a 
> single column as one page. Our page size threshold is set to 1MB so this 
> should actually result in a few hundred pages instead of just 1. 
> This seems to be due to the code in: 
> [ColumnWriterV1.accountForValueWritten()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L93].
>  We only check if we've crossed the memory threshold if the valueCount 
> exceeds the valueCountForNextSizeCheck. However, valueCountForNextSizeCheck 
> can end up getting skewed substantially if the memSize of the first 100 
> values of the column is really small:
> For example, I see this in one of our jobs:
> {code}
> [foo_column] valueCount: 101, memSize: 16, pageSizeThreshold: 1048576
> valueCountForNextSizeCheck = (int)(valueCount + ((float)valueCount * 
> props.getPageSizeThreshold() / memSize)) / 2 + 1;
> [foo_column] valueCountForNextSizeCheck = 3309619
> {code}
> This really large new valueCountForNextSizeCheck, results in our job OOMing 
> as we end up seeing more space consuming values much much earlier than the 
> ~3M valueCount point. 
> At this point, I'm thinking of doing something simple which is similar to 
> [InternalParquetRecordWriter.checkBlockSizeReached()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L143],
>  basically cap the maximum value of the valueCountForNextSizeCheck:
> {code}
> valueCountForNextSizeCheck =
>           Math.min(
>             (int)(valueCount + ((float)valueCount * pageSizeThreshold / 
> memSize)) / 2 + 1,
>             valueCount + MAX_COUNT_FOR_SIZE_CHECK // will not look more than 
> max records ahead
>           );
> {code}
> Open to something more sophisticated if people prefer so. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to