[
https://issues.apache.org/jira/browse/PARQUET-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piyush Narang resolved PARQUET-624.
-----------------------------------
Resolution: Not A Problem
Closing this as there is a way to work around which I missed.
> Value count used for memSize calculation in ColumnWriterV1 can be skewed
> based on first 100 values
> --------------------------------------------------------------------------------------------------
>
> Key: PARQUET-624
> URL: https://issues.apache.org/jira/browse/PARQUET-624
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Reporter: Piyush Narang
> Assignee: Piyush Narang
>
> While digging into some OOMs that we were seeing for some of our Parquet
> writer jobs, I noticed that we were writing out around 250MB+ of data for a
> single column as one page. Our page size threshold is set to 1MB so this
> should actually result in a few hundred pages instead of just 1.
> This seems to be due to the code in:
> [ColumnWriterV1.accountForValueWritten()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L93].
> We only check if we've crossed the memory threshold if the valueCount
> exceeds the valueCountForNextSizeCheck. However, valueCountForNextSizeCheck
> can end up getting skewed substantially if the memSize of the first 100
> values of the column is really small:
> For example, I see this in one of our jobs:
> {code}
> [foo_column] valueCount: 101, memSize: 16, pageSizeThreshold: 1048576
> valueCountForNextSizeCheck = (int)(valueCount + ((float)valueCount *
> props.getPageSizeThreshold() / memSize)) / 2 + 1;
> [foo_column] valueCountForNextSizeCheck = 3309619
> {code}
> This really large new valueCountForNextSizeCheck, results in our job OOMing
> as we end up seeing more space consuming values much much earlier than the
> ~3M valueCount point.
> At this point, I'm thinking of doing something simple which is similar to
> [InternalParquetRecordWriter.checkBlockSizeReached()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L143],
> basically cap the maximum value of the valueCountForNextSizeCheck:
> {code}
> valueCountForNextSizeCheck =
> Math.min(
> (int)(valueCount + ((float)valueCount * pageSizeThreshold /
> memSize)) / 2 + 1,
> valueCount + MAX_COUNT_FOR_SIZE_CHECK // will not look more than
> max records ahead
> );
> {code}
> Open to something more sophisticated if people prefer so.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)