[ 
https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201943#comment-16201943
 ] 

Justin Uang commented on PARQUET-118:
-------------------------------------

We are running into this same issue with Spark. We have some rows that are 
fairly large, and because of the amount of off heap storage being used, yarn is 
killing it for going over the memoryOverhead set in spark. Seems like the 
amount of off heap memory used scales with the size of a row, which appears 
wrong.

> Provide option to use on-heap buffers for Snappy compression/decompression
> --------------------------------------------------------------------------
>
>                 Key: PARQUET-118
>                 URL: https://issues.apache.org/jira/browse/PARQUET-118
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Patrick Wendell
>
> The current code uses direct off-heap buffers for decompression. If many 
> decompressors are instantiated across multiple threads, and/or the objects 
> being decompressed are large, this can lead to a huge amount of off-heap 
> allocation by the JVM. This can be exacerbated if overall, there is not heap 
> contention, since no GC will be performed to reclaim the space used by these 
> buffers.
> It would be nice if there was a flag we cold use to simply allocate on-heap 
> buffers here:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28
> We ran into an issue today where these buffers totaled a very large amount of 
> storage and caused our Java processes (running within containers) to be 
> terminated by the kernel OOM-killer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to