[ 
https://issues.apache.org/jira/browse/HIVE-17684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16254068#comment-16254068
 ] 

Misha Dmitriev commented on HIVE-17684:
---------------------------------------

The problem with {{MapJoinMemoryExhaustionHandler}} is a large percentage of 
false alarms about memory exhaustion. Without it, however, Hive may go into a 
"GC death spiral", where the JVM runs back-to-back full GCs, but doesn't fail 
for long enough. Because user threads are unable to run most of the time, the 
executor stops responding, and the Spark driver eventually drops it after some 
time. This results in hard-to-debug failures, because from the logs it's not 
clear why the executor stopped responding.

I recently added the new 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/GcTimeMonitor.java
 class to hadoop, which allows the user to accurately monitor the percentage of 
time that the JVM spends in GC. When this percentage grows above ~50% over a 
1-minute period, it's almost always a signal that the JVM is in the above "GC 
death spiral". Even if it is not, extremely long GC pauses are very bad for 
performance, and it makes sense to treat them in the same way as OOM, i.e. fail 
the task and ask the user to increase their executors' heap size.

I ran some experiments where I replaced MapJoinMemoryExhaustionHandler with 
checking GC time percentage reported by GcTimeMonitor, and it work well. 
GcTimeMonitor will become available for other projects when Hadoop 3.0.0-GA is 
released (which, according to Hadoop developers, should happen in a few weeks). 
Currently Hive depends on Hadoop 3.0.0-beta1, so to use GcTimeMonitor in Hive, 
we will need to change this dependency to Hadoop GA. Are there any objections 
against:
(a) dependency change from Hadoop 3.0.0-beta1 to 3.0.0-GA
(b) replacing MapJoinMemoryExhaustionHandler with GcTimeMonitor ?

> HoS memory issues with MapJoinMemoryExhaustionHandler
> -----------------------------------------------------
>
>                 Key: HIVE-17684
>                 URL: https://issues.apache.org/jira/browse/HIVE-17684
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>
> We have seen a number of memory issues due the {{HashSinkOperator}} use of 
> the {{MapJoinMemoryExhaustionHandler}}. This handler is meant to detect 
> scenarios where the small table is taking too much space in memory, in which 
> case a {{MapJoinMemoryExhaustionError}} is thrown.
> The configs to control this logic are:
> {{hive.mapjoin.localtask.max.memory.usage}} (default 0.90)
> {{hive.mapjoin.followby.gby.localtask.max.memory.usage}} (default 0.55)
> The handler works by using the {{MemoryMXBean}} and uses the following logic 
> to estimate how much memory the {{HashMap}} is consuming: 
> {{MemoryMXBean#getHeapMemoryUsage().getUsed() / 
> MemoryMXBean#getHeapMemoryUsage().getMax()}}
> The issue is that {{MemoryMXBean#getHeapMemoryUsage().getUsed()}} can be 
> inaccurate. The value returned by this method returns all reachable and 
> unreachable memory on the heap, so there may be a bunch of garbage data, and 
> the JVM just hasn't taken the time to reclaim it all. This can lead to 
> intermittent failures of this check even though a simple GC would have 
> reclaimed enough space for the process to continue working.
> We should re-think the usage of {{MapJoinMemoryExhaustionHandler}} for HoS. 
> In Hive-on-MR this probably made sense to use because every Hive task was run 
> in a dedicated container, so a Hive Task could assume it created most of the 
> data on the heap. However, in Hive-on-Spark there can be multiple Hive Tasks 
> running in a single executor, each doing different things.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to