[ https://issues.apache.org/jira/browse/HBASE-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jerry He resolved HBASE-16581. ------------------------------ Resolution: Duplicate > Optimize Replication queue transfers after server fail over > ----------------------------------------------------------- > > Key: HBASE-16581 > URL: https://issues.apache.org/jira/browse/HBASE-16581 > Project: HBase > Issue Type: Improvement > Reporter: Jerry He > Assignee: Jerry He > > Currently if a region server fails, the replication queue of this server will > be picked up by another region server. The problem is this queue can > possibly be huge and contains queues from other multiple or cascading server > failures. > We had such a case in production. From zk_dump: > {code} > ... > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467603735059: > 18748267 > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471723778060: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468258960080: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468204958990: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469701010649: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1470409989238: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471838985073: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467142915090: > 57804890 > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1472181000614: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471464567365: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469486466965: > > /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467787339841: > 47812951 > ... > {code} > There were hundreds of wals hanging under this queue, coming from diferent > region servers, which took a long time to replicate. > We should have a better strategy which lets live region servers each grep > part of this nested queue, and replicate in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)