[ https://issues.apache.org/jira/browse/HBASE-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550900#comment-13550900 ]
Anoop Sam John commented on HBASE-7034: --------------------------------------- Is this code came in by mistake? {code} RecoverableZooKeeper#setData(String path, byte[] data, int version){ .... byte[] revData = zk.getData(path, false, stat); int idLength = Bytes.toInt(revData, ID_LENGTH_SIZE); int dataLength = revData.length-ID_LENGTH_SIZE-idLength; int dataOffset = ID_LENGTH_SIZE+idLength; if(Bytes.compareTo(revData, ID_LENGTH_SIZE, id.length, revData, dataOffset, dataLength) == 0) { // the bad version is caused by previous successful setData return stat; } } {code} When we write the data to zk, we write an identifier for the process. Here in order to check whether the BADVERSION exception from zookeeper is due to a previous setData (from the same process), we need to compare the id read from the zookeeper and the id for this process (this.id).. Or am I missing some thing. The above offset and length calculating math and compare looks problematic for me. In that case this is the issue for this bug I guess. >From the log it is clear that there is no problem wrt the node and version at >1st. [As part of the transition of state from OPENING to OPENED 1st the >present data is read and the check below tells the data and its version every >thing is fine.] Immediately a connection loss happened. This triggers a retry >for the setData. May be the previous operation made the data change in >zookeeper and master got the data changed event. (?) I think correcting the above code may solve the problems. > Bad version, failed OPENING to OPENED but master thinks it is open anyways > -------------------------------------------------------------------------- > > Key: HBASE-7034 > URL: https://issues.apache.org/jira/browse/HBASE-7034 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 0.94.2 > Reporter: stack > > I have this in RS log: > {code} > 2012-10-22 02:21:50,698 ERROR > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed > transitioning node > b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. > from OPENING to OPENED -- closing region > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f > {code} > Master says this (it is bulk assigning): > {code} > .... > 2012-10-22 02:21:40,673 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:10302-0xb3a862e57a503ba Set watcher on existing znode > /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f > ... > then this > .... > 2012-10-22 02:23:47,089 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:10302-0xb3a862e57a503ba Set watcher on existing znode > /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f > .... > 2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:10302-0xb3a862e57a503ba Retrieved 112 byte(s) of data from znode > /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f and set watcher; > region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., > origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPENED > etc. > {code} > Disagreement as to what is going on here. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira