[ 
https://issues.apache.org/jira/browse/SPARK-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685389#comment-15685389
 ] 

Cheng Lian commented on SPARK-18403:
------------------------------------

Figured it out. It's caused by a false sharing issue inside 
{{ObjectAggregationIterator}}. In short, after setting an {{UnsafeArrayData}} 
to an aggregation buffer, which is a safe row, the underlying buffer of the 
{{UnsafeArrayData}} gets overwritten when iterator steps forward.

Have to say that this issue is pretty hard to debug. The large array allocation 
blows up the JVM right away and you can't really find the large array in the 
heap dump since the allocation itself fails. Therefore, all the heap dumps are 
super small (~70MB) compared to the heap size (3GB for default SBT tests) and 
you can't find anything useful in the heap dumps.

I'm opening a PR to fix this issue.

> ObjectHashAggregateSuite is being flaky (occasional OOM errors)
> ---------------------------------------------------------------
>
>                 Key: SPARK-18403
>                 URL: https://issues.apache.org/jira/browse/SPARK-18403
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>             Fix For: 2.2.0
>
>
> This test suite fails occasionally on Jenkins due to OOM errors. I've already 
> reproduced it locally but haven't figured out the root cause.
> We should probably disable it temporarily before getting it fixed so that it 
> doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to