Ian Friedman created HBASE-10673:
------------------------------------

             Summary: [replication] Recovering a large number of queues can 
cause OOME
                 Key: HBASE-10673
                 URL: https://issues.apache.org/jira/browse/HBASE-10673
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 0.94.5
            Reporter: Ian Friedman


Recently we had an accidental shutdown of approx 400 RegionServers in our 
cluster. Our cluster runs master-master replication to our seconday datacenter. 
Once we got the nodes sorted out and added back to the cluster, we noticed RSes 
continuing to die from OOMExceptions. Investigating the heap dumps from the 
dead RSes I found them full of WALEdits, and a lot of ReplicationSource threads 
generated by the NodeFailoverWorker replicating a lot of queues for the dead 
RSes.

After doing some digging into the code and with a lot of help from [~jdcryans] 
we found that it is possible for the node failover process in a very few 
regionservers to pick up the majority of queues when a lot of RSes die 
simultaneously. This causes a lot of HLogs to be opened and read simultaneously.

as per [~jdcryans]:
{quote}
Well there's a race to grab the queues to recover, and the surviving region 
servers are just scanning the znode with all the RS to try to determine which 
ones are dead (and when it finds a dead one it forks off a NodeFailoverWorker) 
so that if one RS got the notification that one RS died but in reality a bunch 
of them died, it could win a lot of those races.

My guess is after that one died, its queues were redistributed, and maybe you 
had another hero RS, but as more and more died the queues eventually got 
manageable since a single RS cannot win all the races. Something like this:

- 400 die, one RS manages to grab 350
- That RS dies, one other RS manages to grab 300 of the 350 (the other queues 
going to other region servers)
- That second RS dies, hero #3 grabs 250
etc.

...

We don't have the concept of a limit of recovered queues, less so the concept 
of respecting it. Aggregating the queues into one is one solution, but it would 
perform poorly as show with your use case where one region server would be 
responsible to replicate the data for 300 others. It would scale better if the 
queues were more evenly distributed (and it was the original intent, except 
that in practice the dynamics are different).
{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to