[ 
https://issues.apache.org/jira/browse/HBASE-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph updated HBASE-16138:
---------------------------
    Description: 
If we shutdown an entire HBase cluster and attempt to start it back up, we have 
to run the WAL pre-log roll that occurs before opening up a region. Yet this 
pre-log roll must record the new WAL inside of ReplicationQueues. This method 
call ends up blocking on 
TableBasedReplicationQueues.getOrBlockOnReplicationTable(), because the 
Replication Table is not up yet. And we cannot assign the Replication Table 
because we cannot open any regions. This ends up deadlocking the entire cluster 
whenever we lose Replication Table availability. 

There are a few options that we can do, but none of them seem very good:

1. Depend on Zookeeper-based Replication until the Replication Table becomes 
available
2. Have a separate WAL for System Tables that does not perform any replication
3. Record the WAL log in the ReplicationQueue asynchronously (don't block 
opening a region on this event), which could lead to inconsistent Replication 
state

Do you guys have any suggestions/ideas?


  was:If we shutdown an entire HBase cluster and attempt to start it back up, 
we have to run the WAL pre-log roll that occurs before opening up a region. Yet 
this pre-log roll must record the new WAL inside of ReplicationQueues. This 
method call ends up blocking on 
TableBasedReplicationQueues.getOrBlockOnReplicationTable(), because the 
Replication Table is not up yet. And we cannot assign the Replication Table 
because we cannot open any regions. This ends up deadlocking the entire cluster 
whenever we lose the replication table. 


> Cannot open regions after non-graceful shutdown due to deadlock with 
> Replication Table
> --------------------------------------------------------------------------------------
>
>                 Key: HBASE-16138
>                 URL: https://issues.apache.org/jira/browse/HBASE-16138
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Replication
>            Reporter: Joseph
>            Assignee: Joseph
>            Priority: Critical
>
> If we shutdown an entire HBase cluster and attempt to start it back up, we 
> have to run the WAL pre-log roll that occurs before opening up a region. Yet 
> this pre-log roll must record the new WAL inside of ReplicationQueues. This 
> method call ends up blocking on 
> TableBasedReplicationQueues.getOrBlockOnReplicationTable(), because the 
> Replication Table is not up yet. And we cannot assign the Replication Table 
> because we cannot open any regions. This ends up deadlocking the entire 
> cluster whenever we lose Replication Table availability. 
> There are a few options that we can do, but none of them seem very good:
> 1. Depend on Zookeeper-based Replication until the Replication Table becomes 
> available
> 2. Have a separate WAL for System Tables that does not perform any replication
> 3. Record the WAL log in the ReplicationQueue asynchronously (don't block 
> opening a region on this event), which could lead to inconsistent Replication 
> state
> Do you guys have any suggestions/ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to