[ 
https://issues.apache.org/jira/browse/PARQUET-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630228#comment-14630228
 ] 

Daniel Weeks commented on PARQUET-99:
-------------------------------------

This has been affecting us pretty regularly and I have a fix we're using 
internally and will clean it up and supply a patch.

There are two ways this can be triggered:

1) Rows are so large that OOM occurs before the initial page size check 
(ColumnWriter).  This happens a when you have many very large rows (multiple 
megabytes per row).  Of course, this might indicate that you have issues with 
your data, but parquet should be able to work around it.
2) The size of the rows vary significantly and the estimation for the next page 
size check is determined by small rows and is set very high followed by may 
large rows.  This is the case we've been running into recently.

The fix I have is to be able to configure the initial page size check and to 
force constant checking (bypass the estimation) because the estimate can be 
problematic with certain datasets.

-Dan


> Large rows cause unnecessary OOM exceptions
> -------------------------------------------
>
>                 Key: PARQUET-99
>                 URL: https://issues.apache.org/jira/browse/PARQUET-99
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Tongjie Chen
>            Assignee: Daniel Weeks
>
> If columns contains lots of lengthy string value, it will run into OOM error 
> during writing.
> 2014-09-22 19:16:11,626 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.OutOfMemoryError: Java heap space
>       at java.util.Arrays.copyOf(Arrays.java:2271)
>       at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>       at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>       at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:83)
>       at 
> org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:76)
>       at 
> parquet.bytes.CapacityByteArrayOutputStream.writeTo(CapacityByteArrayOutputStream.java:144)
>       at 
> parquet.bytes.BytesInput$CapacityBAOSBytesInput.writeAllTo(BytesInput.java:308)
>       at 
> parquet.bytes.BytesInput$SequenceBytesIn.writeAllTo(BytesInput.java:233)
>       at 
> parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:108)
>       at 
> parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:110)
>       at 
> parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147)
>       at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236)
>       at 
> parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113)
>       at 
> parquet.hadoop.InternalParquetRecordWriter.flushStore(InternalParquetRecordWriter.java:151)
>       at 
> parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:130)
>       at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:122)
>       at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>       at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>       at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:77)
>       at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:90)
>       at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:688)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:832)
>       at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:832)
>       at 
> org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:832)
>       at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:90)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:502)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to