[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

Colin Patrick McCabe (JIRA) Mon, 21 Sep 2015 11:44:26 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901165#comment-14901165
 ]


Colin Patrick McCabe commented on HDFS-9107:
--------------------------------------------

bq. I don't trust monotonicNow if the thread can suspend between calls; cores 
on different sockets may give different answers, though it's not something I've 
seen in the field.

Oracle's blog here [ 
https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks ] says:

bq. If you are interested in measuring/calculating elapsed time, then always 
use System.nanoTime(). On most systems it will give a resolution on the order 
of microseconds. Be aware though, this call can also take microseconds to 
execute on some platforms.

Of course, {{System#nanoTime}} is just a very thin wrapper around the operating 
system's monotonic clock.  In x86-land, the monotonic clock generally comes 
from one of two sources: the TSC (timestamp counter) or the HPET (high 
precision event timer).

In the 2000s, the TSC started becoming less useful because multi-core systems 
started becoming more common, and at that time, TSC wasn't synchronized across 
cores.  This has since changed (at least for Intel systems), and the TSC is now 
synchronized across cores.  So the alarm you are raising is about 5 years too 
late.  Anyway, if you have a "bad" TSC, you can still get {{System#nanoTime}} 
to behave correctly by switching your operating system's clock source to the 
HPET.  It's slower, but more reliable.

If you want to read more about this, check out 
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/332570

tl;dr
1. Operating systems implement various tricks to work around TSC bad behaviors
2. TSC bad behaviors are becoming less common in modern CPUs
3. You don't have to use the TSC if you don't want to!

Let's let the hardware and OS people do their job and just do ours.

I agree with [~hitliuyi]... +1 for the patch.  Would be even better if we could 
close that small window of a GC happening at a time other than during the 
{{Thread#sleep}}.

> Prevent NN's unrecoverable death spiral after full GC
> -----------------------------------------------------
>
>                 Key: HDFS-9107
>                 URL: https://issues.apache.org/jira/browse/HDFS-9107
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.0.0-alpha
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

Reply via email to