[ 
https://issues.apache.org/jira/browse/HBASE-9591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772838#comment-13772838
 ] 

Gabriel Reid commented on HBASE-9591:
-------------------------------------

I don't think I totally understand the situation. 
{quote}I tried killing a region server when the slave cluster was down{quote}

So just to be totally sure, a region server on the master cluster is being 
killed, while the whole slave cluster is down, is that correct? If that's the 
case, I'm assuming that the list of region servers in the peer cluster would 
always remain empty, and wouldn't have any change events coming through ZK, and 
so the timestamp returned by ReplicationPeers#getTimestampOfLastChangeToPeer 
would stay the same. Of course, if that's all the case then it wouldn't lead to 
this situation, so I'm definitely not understanding something.

In any case, I think it does make sense to consider things a noop (and not 
update timestamps) if the list of sinks fetched in 
ReplicationSinkManager#chooseSinks is the same as the last time.

[~lhofhansl] This shouldn't apply to 0.94, the ReplicationSinkManager is only 
0.95+.
                
> [replication] getting "Current list of sinks is out of date" all the time 
> when a source is recovered
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-9591
>                 URL: https://issues.apache.org/jira/browse/HBASE-9591
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.96.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Minor
>             Fix For: 0.96.1
>
>
> I tried killing a region server when the slave cluster was down, from that 
> point on my log was filled with:
> {noformat}
> 2013-09-20 00:31:03,942 INFO  [regionserver60020.replicationSource,1] 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager: 
> Current list of sinks is out of date, updating
> 2013-09-20 00:31:04,226 INFO  
> [ReplicationExecutor-0.replicationSource,1-jdec2hbase0403-4,60020,1379636329634]
>  org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager: 
> Current list of sinks is out of date, updating
> {noformat}
> The first log line is from the normal source, the second is the recovered 
> one. When we try to replicate, we call 
> replicationSinkMgr.getReplicationSink() and if the list of machines was 
> refreshed since the last time then we call chooseSinks() which in turn 
> refreshes the list of sinks and resets our lastUpdateToPeers. The next source 
> will notice the change, and will call chooseSinks() too. The first source is 
> coming for another round, sees the list was refreshed, calls chooseSinks() 
> again. It happens forever until the recovered queue is gone.
> We could have all the sources going to the same cluster share a thread-safe 
> ReplicationSinkManager. We could also manage the same cluster separately for 
> each source. Or even easier, if the list we get in chooseSinks() is the same 
> we had before, consider it a noop.
> What do you think [~gabriel.reid]?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to