[ https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack resolved HBASE-17653. --------------------------- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.0.0 Pushed to master. Thanks for review [~toffer] > HBASE-17624 rsgroup synchronizations will (distributed) deadlock > ---------------------------------------------------------------- > > Key: HBASE-17653 > URL: https://issues.apache.org/jira/browse/HBASE-17653 > Project: HBase > Issue Type: Bug > Components: rsgroup > Reporter: stack > Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-17653.master.001.patch, > HBASE-17653.master.002.patch, HBASE-17653.master.003.patch > > > Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access > to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes > scenario under which we may end up in a deadlock (distributed). Let me > repeat [~toffer] comment... > {code} > Both read/write access can't be single threaded. Consider the situation: > 1. move_rsgroup_servers is called > 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 > holds monitor lock) > 3. while #2 is happening meta is in transition. > Balancer tries to figure out plan for meta region tries to get monitor lock > but can't. rpc thread task won't release monitor lock since rsgroup region > never gets assigned. rsgroup region never gets assigned because it can't > update meta with new state. > There's a good chance this can be reproduce just by moving both rsgroup and > meta region onto the same RS and call move_rsgoup_servers on the same RS. > A bunch different actors will query from group affiliation so we can't have > writes block reads. > .... > In the code prior to this patch the getter methods that retrieve group > information (getRSGroup, ofTable, OfServer, etc) don't require the monitor > lock so the deadlock cycle is broken. > .... > The methods that does mutations and updates to zk and hbase:rsgroup are > synchronized appropriately. Point me to where the incoherence is? > {code} > This issue is about testing/fixing/restoring rsgroup access. Will be back. -- This message was sent by Atlassian JIRA (v6.3.15#6346)