[jira] [Commented] (HBASE-9469) Synchronous replication
[ https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915412#comment-13915412 ] terry zhang commented on HBASE-9469: Hi Feng HongHua, what's about mysql SemiSyncReplication solution? (https://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign) , we need to make sure client write both side is a transaction if you want to data is consistent. Synchronous replication --- Key: HBASE-9469 URL: https://issues.apache.org/jira/browse/HBASE-9469 Project: HBase Issue Type: New Feature Reporter: Feng Honghua Scenario: A/B clusters with master-master replication, client writes to A cluster and A pushes all writes to B cluster, and when A cluster is down, client switches writing to B cluster. But the client's write switch is unsafe due to the replication between A/B is asynchronous: a delete to B cluster which aims to delete a put written earlier can fail due to that put is written to A cluster and isn't successfully pushed to B before A is down. It can be worse if this delete is collected(flush and then major compact occurs) before A cluster is up and that put is eventually pushed to B, the put won't ever be deleted. Can we provide per-table/per-peer synchronous replication which ships the according hlog entry of write before responsing write success to client? By this we can guarantee the client that all write requests for which he got success response when he wrote to A cluster must already have been in B cluster as well. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot
terry zhang created HBASE-9214: -- Summary: CatalogJanitor delete region info in Meta during Restore snapshot Key: HBASE-9214 URL: https://issues.apache.org/jira/browse/HBASE-9214 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot
[ https://issues.apache.org/jira/browse/HBASE-9214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-9214: --- Description: Today I meet a issue during restore snapshot. It can be reproduce in step below: 1. Table t1 create a Snapshot s1 successfully 2. region r1 in t1 split 3. CatalogJanitor Chore begin to work and found daughter do not have reference , so r1 can be deleted 4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table 5. CatalogJanitor delete r1 region info in meta which RestoreSnapshotHelper just inserted . 6. restore snapshot finished. Then we can found there is a hole in t1 after restore snapshot. Data loss. CatalogJanitor delete region info in Meta during Restore snapshot - Key: HBASE-9214 URL: https://issues.apache.org/jira/browse/HBASE-9214 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang Today I meet a issue during restore snapshot. It can be reproduce in step below: 1. Table t1 create a Snapshot s1 successfully 2. region r1 in t1 split 3. CatalogJanitor Chore begin to work and found daughter do not have reference , so r1 can be deleted 4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table 5. CatalogJanitor delete r1 region info in meta which RestoreSnapshotHelper just inserted . 6. restore snapshot finished. Then we can found there is a hole in t1 after restore snapshot. Data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot
[ https://issues.apache.org/jira/browse/HBASE-9214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739503#comment-13739503 ] terry zhang commented on HBASE-9214: Below is some log during testing : 1. CatalogJanitor decide to delete region 2013-08-13 16:19:34,268 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Daughter regiondir does not exist: hdfs://dw77.kgb.sqa.c m4:9900/hbase-test3-snap/writetest/b95023663816ecf208f7ee3d69d8fb9c 2013-08-13 16:19:34,268 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Daughter regiondir does not exist: hdfs://dw77.kgb.sqa.c m4:9900/hbase-test3-snap/writetest/1c47cf89406349c159f36ae2e0c55582 2013-08-13 16:19:34,268 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Deleting region writetest,P 3,1376370447956.bce523f334ca2c78f2c4ad360dc2a6a4. because daughter splits no longer hold references 2 . RestoreSnapshotHelper insert region info to META 2013-08-13 16:19:34,273 DEBUG org.apache.hadoop.hbase.catalog.MetaEditor: Add to META, regions: [{NAME = 'writetest,HZZ ZZX,1376370447956.05a9a252b41e8e41aa3a6a8a797380ab.', STARTKEY = 'H X', ENDKEY = 'LI', ENCODED = 05a9a252b41e8e41aa3a6a8a797380ab,}, { NAME = 'writetest,3L,1376370447956.5689d3dbc9dc97106c28945ab7bb5104.', STARTKEY = '3L', ENDKEY = '76', ENCODED = 568 9d3dbc9dc97106c28945ab7bb5104,}, {NAME = 'writetest,76,1376370447956.5b5a7331c26ead c41ea2944939fd2cee.', STARTKEY = '76', ENDKEY = 'A R', ENCODED = 5b5a7331c26eadc41ea2944939fd2cee,}, {NAME = 'writetest,A R,1376370447956.60226c1fbedc9d98f1192e37d7d3b6af.', STARTKEY = 'AR', ENDKEY = 'EC', ENCODED = 60226c1fbedc9d98f1192e37d7d3b6af,}, {NAME = 'writetest,LLL LLI,1376370447956.70523ec9e600da52b18caa45cf76cf66.', STARTKEY = 'L I', ENDKEY = 'P3', ENCODED = 70523ec9e600da52b18caa45cf76cf66, }, {NAME = 'writetest,SO,1376370447956.74f9eef1cbb3e4f6627ee39a0996dec8.', STARTKEY = 'SO', ENDKEY = 'W9', ENCODED = 74f9eef1cbb3e4f6627ee39a0996dec8,}, {NAME = 'writetest,W9,1376370447956.82cf586c79 382a89508c3325304904f3.', STARTKEY = 'W9', ENDKEY = '', ENCODED = 82cf586c79382a8 9508c3325304904f3,}, {NAME = 'writetest,P3,1 376370447956.bce523f334ca2c78f2c4ad360d c2a6a4.', STARTKEY = 'P3', ENDKEY = 'S O', ENCODED = bce523f334ca2c78f2c4ad360dc2a6a4,}, {NAME = 'writetest,,1376370447944.d3eb975033a94730c88bf6696a413e9e.', STARTK EY = '', ENDKEY = '3L', ENCODED = d3eb975033a94730c88bf6696a413e9e,}, {NAME = 'w ritetest,EC,1376370447956.f03a2b6805852c00e4fe029f3a9e7261.', STARTKEY = 'E C', ENDKEY = 'HX', ENCODED = f03a2b6805852 c00e4fe029f3a9e7261,}] 3. region info is delete by CatalogJanitor 2013-08-13 16:19:34,333 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Deleted region writetest,P7 7773,1376370447956.bce523f334ca2c78f2c4ad360dc2a6a4. from META CatalogJanitor delete region info in Meta during Restore snapshot - Key: HBASE-9214 URL: https://issues.apache.org/jira/browse/HBASE-9214 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang Today I meet a issue during restore snapshot. It can be reproduce in step below: 1. Table t1 create a Snapshot s1 successfully 2. region r1 in t1 split 3. CatalogJanitor Chore begin to work and found daughter do not have reference , so r1 can be deleted 4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table 5. CatalogJanitor
[jira] [Commented] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot
[ https://issues.apache.org/jira/browse/HBASE-9214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739505#comment-13739505 ] terry zhang commented on HBASE-9214: I wonder if we could change CatalogJanitor not to remove split region info when table is disabled to avoid this issue? CatalogJanitor delete region info in Meta during Restore snapshot - Key: HBASE-9214 URL: https://issues.apache.org/jira/browse/HBASE-9214 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang Today I meet a issue during restore snapshot. It can be reproduce in step below: 1. Table t1 create a Snapshot s1 successfully 2. region r1 in t1 split 3. CatalogJanitor Chore begin to work and found daughter do not have reference , so r1 can be deleted 4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table 5. CatalogJanitor delete r1 region info in meta which RestoreSnapshotHelper just inserted . 6. restore snapshot finished. Then we can found there is a hole in t1 after restore snapshot. Data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived
terry zhang created HBASE-9201: -- Summary: Hfile will be deleted after deleteColumn instead be archived Key: HBASE-9201 URL: https://issues.apache.org/jira/browse/HBASE-9201 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived
[ https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737756#comment-13737756 ] terry zhang commented on HBASE-9201: now we can see hfile will be deleted in Master file system instead be archived. {code:title=MasterFileSystem.java|borderStyle=solid} public void deleteFamilyFromFS(HRegionInfo region, byte[] familyName) throws IOException { // archive family store files Path tableDir = FSUtils.getTableDir(rootdir, region.getTableName()); HFileArchiver.archiveFamily(fs, conf, region, tableDir, familyName); // delete the family folder Path familyDir = new Path(tableDir, new Path(region.getEncodedName(), Bytes.toString(familyName))); if (fs.delete(familyDir, true) == false) { throw new IOException(Could not delete family + Bytes.toString(familyName) + from FileSystem for region + region.getRegionNameAsString() + ( + region.getEncodedName() + )); } } {code} Should we use interface archiveStoreFiles instead fs.delete? Hfile will be deleted after deleteColumn instead be archived Key: HBASE-9201 URL: https://issues.apache.org/jira/browse/HBASE-9201 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived
[ https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737759#comment-13737759 ] terry zhang commented on HBASE-9201: Now we found this issue is serious when we deleteColumn after create a snapshot. If we restore the snapshot hlink files can not be found and table can not be enabled. Hfile will be deleted after deleteColumn instead be archived Key: HBASE-9201 URL: https://issues.apache.org/jira/browse/HBASE-9201 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived
[ https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737768#comment-13737768 ] terry zhang commented on HBASE-9201: deleteTable still has same problem public void deleteTable(byte[] tableName) throws IOException { fs.delete(new Path(rootdir, Bytes.toString(tableName)), true); } Hfile will be deleted after deleteColumn instead be archived Key: HBASE-9201 URL: https://issues.apache.org/jira/browse/HBASE-9201 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived
[ https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang resolved HBASE-9201. Resolution: Not A Problem Hfile will be deleted after deleteColumn instead be archived Key: HBASE-9201 URL: https://issues.apache.org/jira/browse/HBASE-9201 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.10 Reporter: terry zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7735) Prevent regions from moving during online snapshot.
[ https://issues.apache.org/jira/browse/HBASE-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606149#comment-13606149 ] terry zhang commented on HBASE-7735: hi @Jonathan Hsieh I mean region is moving between region server before regionserver getonlineregion . This means the moving region doesn't belong to anyone of regionserver. Would this case lead data lose in snapshot and user don't know about it? Prevent regions from moving during online snapshot. --- Key: HBASE-7735 URL: https://issues.apache.org/jira/browse/HBASE-7735 Project: HBase Issue Type: Sub-task Reporter: Jonathan Hsieh To increase the probability of snapshots succeeding, we should attempt to prevent splits and region moves from happening. Currently we take region locks but this could be too late and results in an aborted snapshot. We should probably take the table lock (0.96) when starting a snapshot and for a 0.94 backport we should probably disable the balancer. This will probably not be tackled until after trunk merge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7735) Prevent regions from moving during online snapshot.
[ https://issues.apache.org/jira/browse/HBASE-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607173#comment-13607173 ] terry zhang commented on HBASE-7735: Hi Jonathan Hsieh, Can we use below methods to increase the probability of snapshots succeeding? 1. get all the region list in master. Including moving region , online region and spliting region(parent and child) 2. check the region list to make sure there is no hole in it. 3. generate a snapshot task assign map file in the target snapshot folder. (also assign the moving regions and spliting parent regions to some region server ) 4. when the regionserver start buildSubprocedure. it will compare online region and the assgin region list in task file which master generated . if it is a online region it will be a FlushSnapshotSubprocedure. if it is not online we can think it is a close region. we do not need to flush cache , only need to create reference file(empty file). If the region already in the snapshot folder. we can just skip . I think this would be helpful cause in large cluster moving region and spliting region is a normal stituation. So snapshot maybe always fail when do the verification. what do you think ? Prevent regions from moving during online snapshot. --- Key: HBASE-7735 URL: https://issues.apache.org/jira/browse/HBASE-7735 Project: HBase Issue Type: Sub-task Reporter: Jonathan Hsieh To increase the probability of snapshots succeeding, we should attempt to prevent splits and region moves from happening. Currently we take region locks but this could be too late and results in an aborted snapshot. We should probably take the table lock (0.96) when starting a snapshot and for a 0.94 backport we should probably disable the balancer. This will probably not be tackled until after trunk merge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7735) Prevent regions from moving during online snapshot.
[ https://issues.apache.org/jira/browse/HBASE-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606003#comment-13606003 ] terry zhang commented on HBASE-7735: hi Jonathan, I wonder now how we can make sure all regionserver snapshot manager can get the correct number of region when doing snapshot. Cause if some of the region is moving or spliting , getOnlineRegions will miss these region and snapshot will loss data and there will be hole in meta after we restore snapshot. Prevent regions from moving during online snapshot. --- Key: HBASE-7735 URL: https://issues.apache.org/jira/browse/HBASE-7735 Project: HBase Issue Type: Sub-task Reporter: Jonathan Hsieh To increase the probability of snapshots succeeding, we should attempt to prevent splits and region moves from happening. Currently we take region locks but this could be too late and results in an aborted snapshot. We should probably take the table lock (0.96) when starting a snapshot and for a 0.94 backport we should probably disable the balancer. This will probably not be tackled until after trunk merge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog
terry zhang created HBASE-7886: -- Summary: [replication] hlog zk node will not delete if client roll hlog Key: HBASE-7886 URL: https://issues.apache.org/jira/browse/HBASE-7886 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4 Reporter: terry zhang Assignee: terry zhang if we use the hbase shell command hlog_roll on a regionserver which is configured replication. the Hlog zk node under /hbase/replication/rs/1 can not be deleted. this issue is caused by HBASE-6758. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog
[ https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581990#comment-13581990 ] terry zhang commented on HBASE-7886: hlog zk node is deleted in shipEdits() or {code:title=ReplicationSource.java|borderStyle=solid} if (this.isActive() (gotIOE || currentNbEntries == 0)) { if (this.lastLoggedPosition != this.position) { this.manager.logPositionAndCleanOldLogs(this.currentPath, this.peerClusterZnode, this.position, queueRecovered, currentWALisBeingWrittenTo); this.lastLoggedPosition = this.position; } if (sleepForRetries(Nothing to replicate, sleepMultiplier)) { sleepMultiplier++; } continue; } {code} but after patch HBASE-6758. logPositionAndCleanOldLogs can not delete hlog zk node when currentWALisBeingWrittenTo is true. When log switched and we can see // If we didn't get anything and the queue has an object, it means we // hit the end of the file for sure return seenEntries == 0 processEndOfFile(); // seenEntries is 0 when run 'hlog_roll' in shell So replicationsource will continue and hlog zk node can not deleted. {code:title=ReplicationSource.java|borderStyle=solid} if(readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo)) { continue; } {code} [replication] hlog zk node will not delete if client roll hlog -- Key: HBASE-7886 URL: https://issues.apache.org/jira/browse/HBASE-7886 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4 Reporter: terry zhang Assignee: terry zhang if we use the hbase shell command hlog_roll on a regionserver which is configured replication. the Hlog zk node under /hbase/replication/rs/1 can not be deleted. this issue is caused by HBASE-6758. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog
[ https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581993#comment-13581993 ] terry zhang commented on HBASE-7886: this issue will be also reproduced when no data write to cluster which is same as run 'hlog_roll' in shell. [replication] hlog zk node will not delete if client roll hlog -- Key: HBASE-7886 URL: https://issues.apache.org/jira/browse/HBASE-7886 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4 Reporter: terry zhang Assignee: terry zhang if we use the hbase shell command hlog_roll on a regionserver which is configured replication. the Hlog zk node under /hbase/replication/rs/1 can not be deleted. this issue is caused by HBASE-6758. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog
[ https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-7886: --- Status: Patch Available (was: Open) [replication] hlog zk node will not delete if client roll hlog -- Key: HBASE-7886 URL: https://issues.apache.org/jira/browse/HBASE-7886 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4 Reporter: terry zhang Assignee: terry zhang if we use the hbase shell command hlog_roll on a regionserver which is configured replication. the Hlog zk node under /hbase/replication/rs/1 can not be deleted. this issue is caused by HBASE-6758. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog
[ https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-7886: --- Status: Open (was: Patch Available) [replication] hlog zk node will not delete if client roll hlog -- Key: HBASE-7886 URL: https://issues.apache.org/jira/browse/HBASE-7886 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4 Reporter: terry zhang Assignee: terry zhang if we use the hbase shell command hlog_roll on a regionserver which is configured replication. the Hlog zk node under /hbase/replication/rs/1 can not be deleted. this issue is caused by HBASE-6758. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog
[ https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-7886: --- Attachment: HBASE-7886.patch [replication] hlog zk node will not delete if client roll hlog -- Key: HBASE-7886 URL: https://issues.apache.org/jira/browse/HBASE-7886 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4 Reporter: terry zhang Assignee: terry zhang Attachments: HBASE-7886.patch if we use the hbase shell command hlog_roll on a regionserver which is configured replication. the Hlog zk node under /hbase/replication/rs/1 can not be deleted. this issue is caused by HBASE-6758. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not be deleted if client roll hlog
[ https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-7886: --- Summary: [replication] hlog zk node will not be deleted if client roll hlog (was: [replication] hlog zk node will not delete if client roll hlog) [replication] hlog zk node will not be deleted if client roll hlog -- Key: HBASE-7886 URL: https://issues.apache.org/jira/browse/HBASE-7886 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4 Reporter: terry zhang Assignee: terry zhang Attachments: HBASE-7886.patch if we use the hbase shell command hlog_roll on a regionserver which is configured replication. the Hlog zk node under /hbase/replication/rs/1 can not be deleted. this issue is caused by HBASE-6758. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6770) Allow scanner setCaching to specify size instead of number of rows
[ https://issues.apache.org/jira/browse/HBASE-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562557#comment-13562557 ] terry zhang commented on HBASE-6770: hi Karthik Ranganathan , I saw this patch had checked in to fb branch 0.89-fb last October. when are we going to check it to trunk. This is a good feature to avoid rs OOM. Allow scanner setCaching to specify size instead of number of rows -- Key: HBASE-6770 URL: https://issues.apache.org/jira/browse/HBASE-6770 Project: HBase Issue Type: Sub-task Components: Client, regionserver Reporter: Karthik Ranganathan Assignee: Chen Jin Currently, we have the following api's to customize the behavior of scans: setCaching() - how many rows to cache on client to speed up scans setBatch() - max columns per row to return per row to prevent a very large response. Ideally, we should be able to specify a memory buffer size because: 1. that would take care of both of these use cases. 2. it does not need any knowledge of the size of the rows or cells, as the final thing we are worried about is the available memory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7451) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen
terry zhang created HBASE-7451: -- Summary: [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen Key: HBASE-7451 URL: https://issues.apache.org/jira/browse/HBASE-7451 Project: HBase Issue Type: Bug Components: snapshots Reporter: terry zhang Assignee: terry zhang Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github : https://github.com/matteobertozzi/hbase/ If we create a snapshot and meet regionserver timeout. Rs will be lock and can not put any data. Please take a look at log below : // regionserver snapshot timeout org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135) at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71) at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92) at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89) at org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Caused by: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms ... 3 more 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase. 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv //Waiting for 'commit allowed' latch and do not exist 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase. 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase. 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Finish snapshot - handling in subtasks on error 2012-12-26 18:44:57,212 WARN org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer already marked completed, ignoring! 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children changed event:/hbase-TERRY-73/online-snapshot/prepare 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved start event. 2012-12-26 18:45:17,002 DEBUG org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Looking for new operations under znode:/hbase-TERRY-73/online-snapshot/prepare 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children changed event:/hbase-TERRY-73/online-snapshot/abort 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved abort event. 2012-12-26 18:45:17,003 DEBUG org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Checking for aborted operations on node:/hbase-TERRY-73/online-snapshot/abort 2012-12-26 18:45:21,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:26,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:31,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:36,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit
[jira] [Commented] (HBASE-7451) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen
[ https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540414#comment-13540414 ] terry zhang commented on HBASE-7451: This is because GloballyConsistentRegionLockTask is extends from TwoPhaseCommit (not ThreePhaseCommit). So it do not have OperationAttemptTimer to handle timeout exception. So when GlobalSnapshotOperation meet timeout,below code will not be executed. and AllowCommit will not released. {code:title=GlobalSnapshotOperation.java|borderStyle=solid} @Override public void commit() throws DistributedCommitException { // Release all the locks taken on the involved regions if (ops == null || ops.size() == 0) { LOG.debug(No region operations to release from the snapshot because we didn't get a chance + to create them.); return; } LOG.debug(Releasing commit barrier for globally consistent snapshot.); for (RegionSnapshotOperation op : ops) { ((GloballyConsistentRegionLockTask) op).getAllowCommitLatch().countDown(); } // wait for all the outstanding tasks waitUntilDone(); } {code} So GloballyConsistentRegionLockTask will wait for ever. [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen Key: HBASE-7451 URL: https://issues.apache.org/jira/browse/HBASE-7451 Project: HBase Issue Type: Bug Components: snapshots Reporter: terry zhang Assignee: terry zhang Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github : https://github.com/matteobertozzi/hbase/ If we create a snapshot and meet regionserver timeout. Rs will be lock and can not put any data. Please take a look at log below : // regionserver snapshot timeout org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135) at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71) at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92) at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89) at org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Caused by: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms ... 3 more 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase. 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv //Waiting for 'commit allowed' latch and do not exist 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase. 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase. 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Finish snapshot - handling in subtasks on error 2012-12-26 18:44:57,212 WARN org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer already marked completed, ignoring! 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children changed event:/hbase-TERRY-73/online-snapshot/prepare 2012-12-26 18:45:17,002 INFO
[jira] [Updated] (HBASE-7451) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happened
[ https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-7451: --- Summary: [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happened (was: [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happened -- Key: HBASE-7451 URL: https://issues.apache.org/jira/browse/HBASE-7451 Project: HBase Issue Type: Bug Components: snapshots Reporter: terry zhang Assignee: terry zhang Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github : https://github.com/matteobertozzi/hbase/ If we create a snapshot and meet regionserver timeout. Rs will be lock and can not put any data. Please take a look at log below : // regionserver snapshot timeout org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135) at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71) at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92) at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89) at org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Caused by: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms ... 3 more 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase. 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv //Waiting for 'commit allowed' latch and do not exist 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase. 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase. 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Finish snapshot - handling in subtasks on error 2012-12-26 18:44:57,212 WARN org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer already marked completed, ignoring! 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms) 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children changed event:/hbase-TERRY-73/online-snapshot/prepare 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved start event. 2012-12-26 18:45:17,002 DEBUG org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Looking for new operations under znode:/hbase-TERRY-73/online-snapshot/prepare 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children changed event:/hbase-TERRY-73/online-snapshot/abort 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved abort event. 2012-12-26 18:45:17,003 DEBUG org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Checking for aborted operations on
[jira] [Commented] (HBASE-6802) Export Snapshot
[ https://issues.apache.org/jira/browse/HBASE-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501693#comment-13501693 ] terry zhang commented on HBASE-6802: hi Jesse, where is your snapshot branch? Could you please give us the URL to checkout the project? Thanks so much! Export Snapshot --- Key: HBASE-6802 URL: https://issues.apache.org/jira/browse/HBASE-6802 Project: HBase Issue Type: Sub-task Components: snapshots Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Fix For: hbase-6055, 0.96.0 Attachments: HBASE-6802-v1.patch Export a snapshot to another cluster. - Copy the .snapshot/name folder with all the references - Copy the hfiles/hlogs needed by the snapshot Once the other cluster has the files and the snapshot information it can restore the snapshot. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue
[ https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6695: --- Attachment: HBASE-6695-4trunk_v3.patch trunk version createEphemeralNodeAndWatch didn't throw NodeExistsException when node exist. Add a new function createNoneExistEphemeralNodeAndWatch to throw NodeExistsException when create an existed node . [Replication] Data will lose if RegionServer down during transferqueue -- Key: HBASE-6695 URL: https://issues.apache.org/jira/browse/HBASE-6695 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Priority: Critical Fix For: 0.96.0, 0.94.3 Attachments: HBASE-6695-4trunk.patch, HBASE-6695-4trunk_v2.patch, HBASE-6695-4trunk_v3.patch, HBASE-6695.patch When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. Log: 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data 2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data 2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data 2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at .. This server is down . ZK node status: [zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268] dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6533) [replication] replication will block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453788#comment-13453788 ] terry zhang commented on HBASE-6533: yes,Daniel and Stack. Now replication can't work in hlog compress mode 。 cause compress mode need read hlog sequentially to construct the compressionContext dictionary . But when replication didn't read the entry in the hlog one by one(using Seek).So it can only get a tag(dictIdx) in the hlog. The original data is not exist in compressionContext. Usually we can get below error: java.lang.IndexOutOfBoundsException: index (2) must be less than size (1) at com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:301) at com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:280) at org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:122) at org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.access$000(LRUDictionary.java:69) at org.apache.hadoop.hbase.regionserver.wal.LRUDictionary.getEntry(LRUDictionary.java:40) at org.apache.hadoop.hbase.regionserver.wal.Compressor.readCompressed(Compressor.java:111) at org.apache.hadoop.hbase.regionserver.wal.HLogKey.readFields(HLogKey.java:321) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1851) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:235) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:206) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:435) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:311) [replication] replication will block if WAL compress set differently in master and slave configuration -- Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.3 Attachments: hbase-6533.patch as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
[jira] [Commented] (HBASE-6533) [replication] replication will block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453791#comment-13453791 ] terry zhang commented on HBASE-6533: Can we disable hlog compress mode when we start Replication ? {code:title=HRegionServer.java|borderStyle=solid} if (!conf.getBoolean(HConstants.REPLICATION_ENABLE_KEY, false)) { return; } +if (conf.getBoolean(HConstants.ENABLE_WAL_COMPRESSION, false)) { + throw new RegionServerRunningException(Region server master cluster doesn't support + + Hlog working in compression mode!); +} // read in the name of the source replication class from the config file. String sourceClassname = conf.get(HConstants.REPLICATION_SOURCE_SERVICE_CLASSNAME, {code} Or we need change the replication do not use seek when we read hlog in replication.do not close hlog again and again when we meet EOF exception. Which one is better? [replication] replication will block if WAL compress set differently in master and slave configuration -- Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.3 Attachments: hbase-6533.patch as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180)
[jira] [Commented] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448537#comment-13448537 ] terry zhang commented on HBASE-6533: this is because of master sending the hlog entry in compress mode. But Slave do not know about it. So when slave ipc hbaseserver deserilize the buffer and read the hlog entry fields error will happen. We can let the Master send the buffer in none compress mode. then whether master use hlog compression or not. Slave both can work fine [replication] replication will be block if WAL compress set differently in master and slave configuration - Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Priority: Critical as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146) at org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682) ... 11 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more
[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6533: --- Priority: Critical (was: Major) [replication] replication will be block if WAL compress set differently in master and slave configuration - Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Priority: Critical as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146) at org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682) ... 11 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6533: --- Attachment: hbase-6533.patch [replication] replication will be block if WAL compress set differently in master and slave configuration - Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Priority: Critical Attachments: hbase-6533.patch as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146) at org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682) ... 11 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6533: --- Fix Version/s: 0.94.3 [replication] replication will be block if WAL compress set differently in master and slave configuration - Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Priority: Critical Fix For: 0.94.3 Attachments: hbase-6533.patch as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146) at org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682) ... 11 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6533: --- Assignee: terry zhang [replication] replication will be block if WAL compress set differently in master and slave configuration - Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.3 Attachments: hbase-6533.patch as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146) at org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682) ... 11 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier
terry zhang created HBASE-6719: -- Summary: [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier Key: HBASE-6719 URL: https://issues.apache.org/jira/browse/HBASE-6719 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.2 Please Take a look below code {code:title=ReplicationSource.java|borderStyle=solid} protected boolean openReader(int sleepMultiplier) { { ... catch (IOException ioe) { LOG.warn(peerClusterZnode + Got: , ioe); // TODO Need a better way to determinate if a file is really gone but // TODO without scanning all logs dir if (sleepMultiplier == this.maxRetriesMultiplier) { LOG.warn(Waited too long for this file, considering dumping); return !processEndOfFile(); // Open a file failed over maxRetriesMultiplier(default 10) } } return true; ... } protected boolean processEndOfFile() { if (this.queue.size() != 0) {// Skipped this Hlog . Data loss this.currentPath = null; this.position = 0; return true; } else if (this.queueRecovered) { // Terminate Failover Replication source thread ,data loss this.manager.closeRecoveredQueue(this); LOG.info(Finished recovering the queue); this.running = false; return true; } return false; } {code} Some Time HDFS will meet some problem but actually Hlog file is OK , So after HDFS back ,Some data will lose and can not find them back in slave cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier
[ https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6719: --- Attachment: hbase-6719.patch [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier - Key: HBASE-6719 URL: https://issues.apache.org/jira/browse/HBASE-6719 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.2 Attachments: hbase-6719.patch Please Take a look below code {code:title=ReplicationSource.java|borderStyle=solid} protected boolean openReader(int sleepMultiplier) { { ... catch (IOException ioe) { LOG.warn(peerClusterZnode + Got: , ioe); // TODO Need a better way to determinate if a file is really gone but // TODO without scanning all logs dir if (sleepMultiplier == this.maxRetriesMultiplier) { LOG.warn(Waited too long for this file, considering dumping); return !processEndOfFile(); // Open a file failed over maxRetriesMultiplier(default 10) } } return true; ... } protected boolean processEndOfFile() { if (this.queue.size() != 0) {// Skipped this Hlog . Data loss this.currentPath = null; this.position = 0; return true; } else if (this.queueRecovered) { // Terminate Failover Replication source thread ,data loss this.manager.closeRecoveredQueue(this); LOG.info(Finished recovering the queue); this.running = false; return true; } return false; } {code} Some Time HDFS will meet some problem but actually Hlog file is OK , So after HDFS back ,Some data will lose and can not find them back in slave cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier
[ https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448578#comment-13448578 ] terry zhang commented on HBASE-6719: I think we need to handle the IOException carefully and better not to skip the Hlog unless it is really corrupted. We can log this failture as a fatal in Log and skip the Hlog (by delete the hlog zk node manually ) if we have to. [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier - Key: HBASE-6719 URL: https://issues.apache.org/jira/browse/HBASE-6719 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.2 Attachments: hbase-6719.patch Please Take a look below code {code:title=ReplicationSource.java|borderStyle=solid} protected boolean openReader(int sleepMultiplier) { { ... catch (IOException ioe) { LOG.warn(peerClusterZnode + Got: , ioe); // TODO Need a better way to determinate if a file is really gone but // TODO without scanning all logs dir if (sleepMultiplier == this.maxRetriesMultiplier) { LOG.warn(Waited too long for this file, considering dumping); return !processEndOfFile(); // Open a file failed over maxRetriesMultiplier(default 10) } } return true; ... } protected boolean processEndOfFile() { if (this.queue.size() != 0) {// Skipped this Hlog . Data loss this.currentPath = null; this.position = 0; return true; } else if (this.queueRecovered) { // Terminate Failover Replication source thread ,data loss this.manager.closeRecoveredQueue(this); LOG.info(Finished recovering the queue); this.running = false; return true; } return false; } {code} Some Time HDFS will meet some problem but actually Hlog file is OK , So after HDFS back ,Some data will lose and can not find them back in slave cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier
[ https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448584#comment-13448584 ] terry zhang commented on HBASE-6719: now we can handler it like below: hlog size = 0, Hlog queue =0,Recovery thread = yes. Terminate recovery thread(return !processEndOfFile()) hlog size = 0, Hlog queue =0,Recovery thread = no. Continue Loop (return !processEndOfFile()) hlog size = 0, Hlog queue !=0,Recovery thread = yes. Skip hlog (return !processEndOfFile()) hlog size = 0, Hlog queue !=0,Recovery thread = no. skip hlog (return !processEndOfFile()) hlog size = 1, Hlog queue =0,Recovery thread = yes. LOG as a Fatal mistake in regionserver's log hlog size = 1, Hlog queue =0,Recovery thread = no. LOG as a Fatal mistake in regionserver's log hlog size = 1, Hlog queue !=0,Recovery thread = yes. LOG as a Fatal mistake in regionserver's log hlog size = 1, Hlog queue !=0,Recovery thread = no. LOG as a Fatal mistake in regionserver's log [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier - Key: HBASE-6719 URL: https://issues.apache.org/jira/browse/HBASE-6719 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.2 Attachments: hbase-6719.patch Please Take a look below code {code:title=ReplicationSource.java|borderStyle=solid} protected boolean openReader(int sleepMultiplier) { { ... catch (IOException ioe) { LOG.warn(peerClusterZnode + Got: , ioe); // TODO Need a better way to determinate if a file is really gone but // TODO without scanning all logs dir if (sleepMultiplier == this.maxRetriesMultiplier) { LOG.warn(Waited too long for this file, considering dumping); return !processEndOfFile(); // Open a file failed over maxRetriesMultiplier(default 10) } } return true; ... } protected boolean processEndOfFile() { if (this.queue.size() != 0) {// Skipped this Hlog . Data loss this.currentPath = null; this.position = 0; return true; } else if (this.queueRecovered) { // Terminate Failover Replication source thread ,data loss this.manager.closeRecoveredQueue(this); LOG.info(Finished recovering the queue); this.running = false; return true; } return false; } {code} Some Time HDFS will meet some problem but actually Hlog file is OK , So after HDFS back ,Some data will lose and can not find them back in slave cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier
[ https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448586#comment-13448586 ] terry zhang commented on HBASE-6719: hlog size=1 Means hlog size is not 0.( hlog size != 0) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier - Key: HBASE-6719 URL: https://issues.apache.org/jira/browse/HBASE-6719 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Critical Fix For: 0.94.2 Attachments: hbase-6719.patch Please Take a look below code {code:title=ReplicationSource.java|borderStyle=solid} protected boolean openReader(int sleepMultiplier) { { ... catch (IOException ioe) { LOG.warn(peerClusterZnode + Got: , ioe); // TODO Need a better way to determinate if a file is really gone but // TODO without scanning all logs dir if (sleepMultiplier == this.maxRetriesMultiplier) { LOG.warn(Waited too long for this file, considering dumping); return !processEndOfFile(); // Open a file failed over maxRetriesMultiplier(default 10) } } return true; ... } protected boolean processEndOfFile() { if (this.queue.size() != 0) {// Skipped this Hlog . Data loss this.currentPath = null; this.position = 0; return true; } else if (this.queueRecovered) { // Terminate Failover Replication source thread ,data loss this.manager.closeRecoveredQueue(this); LOG.info(Finished recovering the queue); this.running = false; return true; } return false; } {code} Some Time HDFS will meet some problem but actually Hlog file is OK , So after HDFS back ,Some data will lose and can not find them back in slave cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue
[ https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6695: --- Attachment: HBASE-6695-4trunk_v2.patch Check regionserver stopper during loop [Replication] Data will lose if RegionServer down during transferqueue -- Key: HBASE-6695 URL: https://issues.apache.org/jira/browse/HBASE-6695 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Priority: Critical Fix For: 0.96.0, 0.94.3 Attachments: HBASE-6695-4trunk.patch, HBASE-6695-4trunk_v2.patch, HBASE-6695.patch When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. Log: 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data 2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data 2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data 2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at .. This server is down . ZK node status: [zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268] dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue
[ https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6695: --- Attachment: HBASE-6695-4trunk.patch add patch for trunk [Replication] Data will lose if RegionServer down during transferqueue -- Key: HBASE-6695 URL: https://issues.apache.org/jira/browse/HBASE-6695 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Priority: Critical Fix For: 0.96.0, 0.94.3 Attachments: HBASE-6695-4trunk.patch, HBASE-6695.patch When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. Log: 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data 2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data 2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data 2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at .. This server is down . ZK node status: [zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268] dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0
terry zhang created HBASE-6700: -- Summary: [replication] replication node will never delete if copy newQueues size is 0 Key: HBASE-6700 URL: https://issues.apache.org/jira/browse/HBASE-6700 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Please check code below {code:title=ReplicationSourceManager.java|borderStyle=solid} // NodeFailoverWorker class public void run() { { ... LOG.info(Moving + rsZnode + 's hlogs to my queue); SortedMapString, SortedSetString newQueues = zkHelper.copyQueuesFromRS(rsZnode); // Node create here* zkHelper.deleteRsQueues(rsZnode); if (newQueues == null || newQueues.size() == 0) { return; } ... } public void closeRecoveredQueue(ReplicationSourceInterface src) { LOG.info(Done with the recovered queue + src.getPeerClusterZnode()); this.oldsources.remove(src); this.zkHelper.deleteSource(src.getPeerClusterZnode(), false); // Node delete here* } {code} So from code we can see if newQueues == null or newQueues.size() == 0, Failover replication Source will never start and the failover zk node will never deleted. eg below failover node will never be delete: [zk: 10.232.98.77:2181(CONNECTED) 16] ls /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60 020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw8 8.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1 346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb .sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,134631 5315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa. cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,13463212990 40-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,6 0020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929 [] *= Empty node * -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0
[ https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6700: --- Description: Please check code below {code:title=ReplicationSourceManager.java|borderStyle=solid} // NodeFailoverWorker class public void run() { { ... LOG.info(Moving + rsZnode + 's hlogs to my queue); SortedMapString, SortedSetString newQueues = zkHelper.copyQueuesFromRS(rsZnode); // Node create here* zkHelper.deleteRsQueues(rsZnode); if (newQueues == null || newQueues.size() == 0) { return; } ... } public void closeRecoveredQueue(ReplicationSourceInterface src) { LOG.info(Done with the recovered queue + src.getPeerClusterZnode()); this.oldsources.remove(src); this.zkHelper.deleteSource(src.getPeerClusterZnode(), false); // Node delete here* } {code} So from code we can see if newQueues == null or newQueues.size() == 0, Failover replication Source will never start and the failover zk node will never deleted. eg below failover node will never be delete: [zk: 10.232.98.77:2181(CONNECTED) 16] ls /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929 [] was: Please check code below {code:title=ReplicationSourceManager.java|borderStyle=solid} // NodeFailoverWorker class public void run() { { ... LOG.info(Moving + rsZnode + 's hlogs to my queue); SortedMapString, SortedSetString newQueues = zkHelper.copyQueuesFromRS(rsZnode); // Node create here* zkHelper.deleteRsQueues(rsZnode); if (newQueues == null || newQueues.size() == 0) { return; } ... } public void closeRecoveredQueue(ReplicationSourceInterface src) { LOG.info(Done with the recovered queue + src.getPeerClusterZnode()); this.oldsources.remove(src); this.zkHelper.deleteSource(src.getPeerClusterZnode(), false); // Node delete here* } {code} So from code we can see if newQueues == null or newQueues.size() == 0, Failover replication Source will never start and the failover zk node will never deleted. eg below failover node will never be delete: [zk: 10.232.98.77:2181(CONNECTED) 16] ls /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60 020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw8 8.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1 346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb .sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,134631 5315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa. cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,13463212990 40-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,6 0020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929 [] *= Empty node * [replication] replication node will never delete if copy newQueues size is 0 Key: HBASE-6700 URL: https://issues.apache.org/jira/browse/HBASE-6700 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Please check code below
[jira] [Updated] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0
[ https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6700: --- Description: Please check code below {code:title=ReplicationSourceManager.java|borderStyle=solid} // NodeFailoverWorker class public void run() { { ... LOG.info(Moving + rsZnode + 's hlogs to my queue); SortedMapString, SortedSetString newQueues = zkHelper.copyQueuesFromRS(rsZnode); // Node create here* zkHelper.deleteRsQueues(rsZnode); if (newQueues == null || newQueues.size() == 0) { return; } ... } public void closeRecoveredQueue(ReplicationSourceInterface src) { LOG.info(Done with the recovered queue + src.getPeerClusterZnode()); this.oldsources.remove(src); this.zkHelper.deleteSource(src.getPeerClusterZnode(), false); // Node delete here* } {code} So from code we can see if newQueues == null or newQueues.size() == 0, Failover replication Source will never start and the failover zk node will never deleted. eg below failover node will never be delete: [zk: 10.232.98.77:2181(CONNECTED) 16] ls /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929 [] // empty node will never be deleted was: Please check code below {code:title=ReplicationSourceManager.java|borderStyle=solid} // NodeFailoverWorker class public void run() { { ... LOG.info(Moving + rsZnode + 's hlogs to my queue); SortedMapString, SortedSetString newQueues = zkHelper.copyQueuesFromRS(rsZnode); // Node create here* zkHelper.deleteRsQueues(rsZnode); if (newQueues == null || newQueues.size() == 0) { return; } ... } public void closeRecoveredQueue(ReplicationSourceInterface src) { LOG.info(Done with the recovered queue + src.getPeerClusterZnode()); this.oldsources.remove(src); this.zkHelper.deleteSource(src.getPeerClusterZnode(), false); // Node delete here* } {code} So from code we can see if newQueues == null or newQueues.size() == 0, Failover replication Source will never start and the failover zk node will never deleted. eg below failover node will never be delete: [zk: 10.232.98.77:2181(CONNECTED) 16] ls /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929 [] [replication] replication node will never delete if copy newQueues size is 0 Key: HBASE-6700 URL: https://issues.apache.org/jira/browse/HBASE-6700 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Please check code below
[jira] [Commented] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue
[ https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445727#comment-13445727 ] terry zhang commented on HBASE-6695: [~lhofhansl] {noformat} logQueue.add(hlog); + ZKUtil.deleteNodeRecursively(this.zookeeper, z); } {noformat} If delete the hlog zk node after copy to new rs , Then 1 HLog won't be not replayed by 2 region server. [Replication] Data will lose if RegionServer down during transferqueue -- Key: HBASE-6695 URL: https://issues.apache.org/jira/browse/HBASE-6695 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Priority: Critical Fix For: 0.96.0, 0.94.3 Attachments: HBASE-6695.patch When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. Log: 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data 2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data 2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data 2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at .. This server is down . ZK node status: [zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268] dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0
[ https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6700: --- Attachment: HBASE-6700.patch [replication] replication node will never delete if copy newQueues size is 0 Key: HBASE-6700 URL: https://issues.apache.org/jira/browse/HBASE-6700 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Attachments: HBASE-6700.patch Please check code below {code:title=ReplicationSourceManager.java|borderStyle=solid} // NodeFailoverWorker class public void run() { { ... LOG.info(Moving + rsZnode + 's hlogs to my queue); SortedMapString, SortedSetString newQueues = zkHelper.copyQueuesFromRS(rsZnode); // Node create here* zkHelper.deleteRsQueues(rsZnode); if (newQueues == null || newQueues.size() == 0) { return; } ... } public void closeRecoveredQueue(ReplicationSourceInterface src) { LOG.info(Done with the recovered queue + src.getPeerClusterZnode()); this.oldsources.remove(src); this.zkHelper.deleteSource(src.getPeerClusterZnode(), false); // Node delete here* } {code} So from code we can see if newQueues == null or newQueues.size() == 0, Failover replication Source will never start and the failover zk node will never deleted. eg below failover node will never be delete: [zk: 10.232.98.77:2181(CONNECTED) 16] ls /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929 [] // empty node will never be deleted -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0
[ https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445734#comment-13445734 ] terry zhang commented on HBASE-6700: we can let the NodeFailoverWorker create newClusterZnode after check the hlog size [replication] replication node will never delete if copy newQueues size is 0 Key: HBASE-6700 URL: https://issues.apache.org/jira/browse/HBASE-6700 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Attachments: HBASE-6700.patch Please check code below {code:title=ReplicationSourceManager.java|borderStyle=solid} // NodeFailoverWorker class public void run() { { ... LOG.info(Moving + rsZnode + 's hlogs to my queue); SortedMapString, SortedSetString newQueues = zkHelper.copyQueuesFromRS(rsZnode); // Node create here* zkHelper.deleteRsQueues(rsZnode); if (newQueues == null || newQueues.size() == 0) { return; } ... } public void closeRecoveredQueue(ReplicationSourceInterface src) { LOG.info(Done with the recovered queue + src.getPeerClusterZnode()); this.oldsources.remove(src); this.zkHelper.deleteSource(src.getPeerClusterZnode(), false); // Node delete here* } {code} So from code we can see if newQueues == null or newQueues.size() == 0, Failover replication Source will never start and the failover zk node will never deleted. eg below failover node will never be delete: [zk: 10.232.98.77:2181(CONNECTED) 16] ls /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929 [] // empty node will never be deleted -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue
[ https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6695: --- Description: When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. Log: 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data 2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data 2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data 2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at .. {color:red} This server is down . {color} ZK node status: [zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268] {color:red} dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted {color} was: When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. Log: 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data 2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data 2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data 2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at .. {color:red} This server is down . {color} ZK node status: [zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268] {color:red} dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted {color} [Replication] Data will lose if RegionServer down during transferqueue -- Key: HBASE-6695 URL: https://issues.apache.org/jira/browse/HBASE-6695 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Priority: Critical When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. Log: 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating
[jira] [Commented] (HBASE-6652) [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication
[ https://issues.apache.org/jira/browse/HBASE-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440981#comment-13440981 ] terry zhang commented on HBASE-6652: if we use patch HBASE-6165 and don't set custom queue size , replication will us IPC call queue. So if hbase.region.server.handler.count set too much, the slave cluster region server maybe out of memory when replication running. So can we replicationQueueSizeCapacity default value to 4M? [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication Key: HBASE-6652 URL: https://issues.apache.org/jira/browse/HBASE-6652 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang now our replication replicationQueueSizeCapacity is set to 64M and replicationQueueNbCapacity is set to 25000. So when a master cluster with many regionserver replicate to a small cluster 。 Slave rpc queue will full and out of memory . java.util.concurrent.ExecutionException: java.io.IOException: Call queue is full, is ipc.server.max.callqueue.size too small? at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java: 1524) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1376) at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:700) at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.batch(HTablePool.java:361) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:172) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:139) at org.apache.hadoop.hbase.regionserver.HRegionServer.replicateLogEntries(HRegionServer.java:4018) at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:361) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1414) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6652) [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication
[ https://issues.apache.org/jira/browse/HBASE-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440985#comment-13440985 ] terry zhang commented on HBASE-6652: another case will case slave region server oom is master disable replication and restart many times. When we enable replication master region server will start many recovery thread (many zk node in replication/rs/xxx/). this will still let the slave rs work in very heavy load. [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication Key: HBASE-6652 URL: https://issues.apache.org/jira/browse/HBASE-6652 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang now our replication replicationQueueSizeCapacity is set to 64M and replicationQueueNbCapacity is set to 25000. So when a master cluster with many regionserver replicate to a small cluster 。 Slave rpc queue will full and out of memory . java.util.concurrent.ExecutionException: java.io.IOException: Call queue is full, is ipc.server.max.callqueue.size too small? at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java: 1524) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1376) at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:700) at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.batch(HTablePool.java:361) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:172) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:139) at org.apache.hadoop.hbase.regionserver.HRegionServer.replicateLogEntries(HRegionServer.java:4018) at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:361) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1414) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6652) [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication
terry zhang created HBASE-6652: -- Summary: [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication Key: HBASE-6652 URL: https://issues.apache.org/jira/browse/HBASE-6652 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang now our replication replicationQueueSizeCapacity is set to 64M and replicationQueueNbCapacity is set to 25000. So when a master cluster with many regionserver replicate to a small cluster 。 Slave rpc queue will full and out of memory . java.util.concurrent.ExecutionException: java.io.IOException: Call queue is full, is ipc.server.max.callqueue.size too small? at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java: 1524) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1376) at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:700) at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.batch(HTablePool.java:361) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:172) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:139) at org.apache.hadoop.hbase.regionserver.HRegionServer.replicateLogEntries(HRegionServer.java:4018) at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:361) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1414) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6624) [Replication]currentNbOperations should set to 0 after update the shippedOpsRate
terry zhang created HBASE-6624: -- Summary: [Replication]currentNbOperations should set to 0 after update the shippedOpsRate Key: HBASE-6624 URL: https://issues.apache.org/jira/browse/HBASE-6624 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang now currentNbOperations will not reset to 0 and increase after replication start. Now this value is used for calculate shippedOpsRate. if it is not reset to 0 shippedOpsRate is not correct -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6624) [Replication]currentNbOperations should set to 0 after update the shippedOpsRate
[ https://issues.apache.org/jira/browse/HBASE-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6624: --- Attachment: jira-6624.patch [Replication]currentNbOperations should set to 0 after update the shippedOpsRate Key: HBASE-6624 URL: https://issues.apache.org/jira/browse/HBASE-6624 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Attachments: jira-6624.patch now currentNbOperations will not reset to 0 and increase after replication start. Now this value is used for calculate shippedOpsRate. if it is not reset to 0 shippedOpsRate is not correct -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6623) [replication] replication metrics value AgeOfLastShippedOp is not set correctly
terry zhang created HBASE-6623: -- Summary: [replication] replication metrics value AgeOfLastShippedOp is not set correctly Key: HBASE-6623 URL: https://issues.apache.org/jira/browse/HBASE-6623 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Minor From code below we can see AgeOfLastShippedOp is not set correctly {code:title=ReplicationSource.java|borderStyle=solid} // entriesArray init public void init(){ this.entriesArray = new HLog.Entry[this.replicationQueueNbCapacity]; for (int i = 0; i this.replicationQueueNbCapacity; i++) { this.entriesArray[i] = new HLog.Entry(); } } //set the metrics value should not get the array length protected void shipEdits() { ... this.metrics.setAgeOfLastShippedOp( this.entriesArray[this.entriesArray.length-1].getKey().getWriteTime()); ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6623) [replication] replication metrics value AgeOfLastShippedOp is not set correctly
[ https://issues.apache.org/jira/browse/HBASE-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438461#comment-13438461 ] terry zhang commented on HBASE-6623: we can use currentNbEntries instead of this.entriesArray.length [replication] replication metrics value AgeOfLastShippedOp is not set correctly --- Key: HBASE-6623 URL: https://issues.apache.org/jira/browse/HBASE-6623 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Minor From code below we can see AgeOfLastShippedOp is not set correctly {code:title=ReplicationSource.java|borderStyle=solid} // entriesArray init public void init(){ this.entriesArray = new HLog.Entry[this.replicationQueueNbCapacity]; for (int i = 0; i this.replicationQueueNbCapacity; i++) { this.entriesArray[i] = new HLog.Entry(); } } //set the metrics value should not get the array length protected void shipEdits() { ... this.metrics.setAgeOfLastShippedOp( this.entriesArray[this.entriesArray.length-1].getKey().getWriteTime()); ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6623) [replication] replication metrics value AgeOfLastShippedOp is not set correctly
[ https://issues.apache.org/jira/browse/HBASE-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6623: --- Attachment: jira-6623.patch [replication] replication metrics value AgeOfLastShippedOp is not set correctly --- Key: HBASE-6623 URL: https://issues.apache.org/jira/browse/HBASE-6623 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.1 Reporter: terry zhang Assignee: terry zhang Priority: Minor Attachments: jira-6623.patch From code below we can see AgeOfLastShippedOp is not set correctly {code:title=ReplicationSource.java|borderStyle=solid} // entriesArray init public void init(){ this.entriesArray = new HLog.Entry[this.replicationQueueNbCapacity]; for (int i = 0; i this.replicationQueueNbCapacity; i++) { this.entriesArray[i] = new HLog.Entry(); } } //set the metrics value should not get the array length protected void shipEdits() { ... this.metrics.setAgeOfLastShippedOp( this.entriesArray[this.entriesArray.length-1].getKey().getWriteTime()); ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431640#comment-13431640 ] terry zhang commented on HBASE-6533: so sorry create so many same issue because my IE issue. Could anyone help me delete them ? [replication] replication will be block if WAL compress set differently in master and slave configuration - Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146) at org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682) ... 11 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration
[ https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431643#comment-13431643 ] terry zhang commented on HBASE-6533: now we can only go around this issue is set the master back to uncompressed mode and delete the zk node replication/rs. And restart the master cluster. Cause replication slave don't support reading compress hlog. But if we have multi master and some of them set hlog to compressed mode. Them we can not handler this situation. [replication] replication will be block if WAL compress set differently in master and slave configuration - Key: HBASE-6533 URL: https://issues.apache.org/jira/browse/HBASE-6533 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang as we know in hbase 0.94.0 we have a configuration below property namehbase.regionserver.wal.enablecompression/name valuetrue/value /property if we enable it in master cluster and disable it in slave cluster . Then replication will not work. It will throw unwrapRemoteException again and again in master cluster. 2012-08-09 12:49:55,892 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of an error on the remote cluster: java.io.IOException: IPC server unable to read call parameters: Error in readFields at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365) Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: Error in readFields at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151) at $Proxy13.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616) ... 1 more This is because Slave cluster can not parse the hlog entry . 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to read call parameters for client 10.232.98.89 java.io.IOException: Error in readFields at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635) at org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146) at org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767) at org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682) ... 11 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature
[ https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422872#comment-13422872 ] terry zhang commented on HBASE-6453: Hi,Stack. I think we can use point in time feature with Snapshots feature (HBASE-6055) in below case. 1. Master cluster in data center China do a Snapshot at Time A 2. copy the Snapshot to Slave cluster in data center US and Set Replication Time stamp to A 3. Restore snapshots (HBASE-6230 ) for Slave cluster and start replication. Them Slave cluster data will be as same as Master Cluster . And data is more safe and US user can visit Slave cluster for getting or scanning data to decrease the stress for China data center. Enable replication can not control the accurate time so may be it will lose some data or replicate some useless data. Mysql also has point in time/position feature in it replication Framework. It is very convenience for data center administrate to use. We can give some better name for this operation cause I am not good at naming ... Hbase Replication point in time feature --- Key: HBASE-6453 URL: https://issues.apache.org/jira/browse/HBASE-6453 Project: HBase Issue Type: New Feature Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Attachments: hbase-6453-v1.patch Now we can not control when hbase replication start to work. this patch support we set a time stamp filter . All the row which is below this time stamp will not be replicated. We also can delete and show this time stamp in hbase shell if we want to change it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6453) Hbase Replication point in time feature
terry zhang created HBASE-6453: -- Summary: Hbase Replication point in time feature Key: HBASE-6453 URL: https://issues.apache.org/jira/browse/HBASE-6453 Project: HBase Issue Type: New Feature Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Now we can not control when hbase replication start to work. this patch support we set a time stamp filter . All the row which is below this time stamp will not be replicated. We also can delete and show this time stamp in hbase shell if we want to change it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6453) Hbase Replication point in time feature
[ https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6453: --- Attachment: hbase-6453-v1.patch Hbase Replication point in time feature --- Key: HBASE-6453 URL: https://issues.apache.org/jira/browse/HBASE-6453 Project: HBase Issue Type: New Feature Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Attachments: hbase-6453-v1.patch Now we can not control when hbase replication start to work. this patch support we set a time stamp filter . All the row which is below this time stamp will not be replicated. We also can delete and show this time stamp in hbase shell if we want to change it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature
[ https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422147#comment-13422147 ] terry zhang commented on HBASE-6453: {code} hbase(main):001:0 set_timefilter ERROR: wrong number of arguments (0 for 2) Here is some help for this command: Set a peer cluster time filter to replicate to, the row which time stamp is before the timestamp will be filtered. Examples: hbase set_timefilter '1', 1329896850047 hbase set_timefilter '2', 1329896850047 hbase(main):002:0 set_timefilter '1',1329896850047 0 row(s) in 0.3000 seconds {code} set time stamp to 1329896850047. Them all the kvs which is early than 1329896850047 will be filterd Hbase Replication point in time feature --- Key: HBASE-6453 URL: https://issues.apache.org/jira/browse/HBASE-6453 Project: HBase Issue Type: New Feature Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Attachments: hbase-6453-v1.patch Now we can not control when hbase replication start to work. this patch support we set a time stamp filter . All the row which is below this time stamp will not be replicated. We also can delete and show this time stamp in hbase shell if we want to change it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature
[ https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422148#comment-13422148 ] terry zhang commented on HBASE-6453: {code} hbase(main):003:0 get_timefilter '1' PEER_IDTIME_FILTER 1 1329896850047 {code} we can show the time stamp by clusterId and check if it is set correctly Hbase Replication point in time feature --- Key: HBASE-6453 URL: https://issues.apache.org/jira/browse/HBASE-6453 Project: HBase Issue Type: New Feature Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Attachments: hbase-6453-v1.patch Now we can not control when hbase replication start to work. this patch support we set a time stamp filter . All the row which is below this time stamp will not be replicated. We also can delete and show this time stamp in hbase shell if we want to change it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature
[ https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422149#comment-13422149 ] terry zhang commented on HBASE-6453: we also can drop the time filter. after we drop it , time stamp change to zero . and all the kvs will be replicated. {code} hbase(main):004:0 drop_timefilter '1' 0 row(s) in 0.0030 seconds hbase(main):005:0 get_timefilter '1' PEER_IDTIME_FILTER 1 0 {code} Hbase Replication point in time feature --- Key: HBASE-6453 URL: https://issues.apache.org/jira/browse/HBASE-6453 Project: HBase Issue Type: New Feature Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Assignee: terry zhang Attachments: hbase-6453-v1.patch Now we can not control when hbase replication start to work. this patch support we set a time stamp filter . All the row which is below this time stamp will not be replicated. We also can delete and show this time stamp in hbase shell if we want to change it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0
terry zhang created HBASE-6446: -- Summary: Replication source will throw EOF exception when hlog size is 0 Key: HBASE-6446 URL: https://issues.apache.org/jira/browse/HBASE-6446 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang when master cluster startup new hlog which size is 0 will be created. if we start replication, replication source will print many EOF exception when openreader. I think we need to ignore this case and do not print so many exception warning log . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0
[ https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13421310#comment-13421310 ] terry zhang commented on HBASE-6446: []$ hadoop dfs -ls /hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581 Found 3 items rw-rr- 3 wuting supergroup 578 2012-07-24 15:20 /hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581/dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114427921 rw-rr- 3 wuting supergroup 399 2012-07-24 15:20 /hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581/dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114433385 {color:red} rw-rr- 3 wuting supergroup 0 2012-07-24 15:20 /hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581/dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114434732 {color} 2012-07-24 15:24:55,516 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114434732 at 0 2012-07-24 15:24:55,521 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1465) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1437) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1424) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1419) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:721) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:475) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:286) Replication source will throw EOF exception when hlog size is 0 --- Key: HBASE-6446 URL: https://issues.apache.org/jira/browse/HBASE-6446 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang when master cluster startup new hlog which size is 0 will be created. if we start replication, replication source will print many EOF exception when openreader. I think we need to ignore this case and do not print so many exception warning log . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0
[ https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6446: --- Attachment: hbase-6446.patch Replication source will throw EOF exception when hlog size is 0 --- Key: HBASE-6446 URL: https://issues.apache.org/jira/browse/HBASE-6446 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Attachments: hbase-6446.patch when master cluster startup new hlog which size is 0 will be created. if we start replication, replication source will print many EOF exception when openreader. I think we need to ignore this case and do not print so many exception warning log . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0
[ https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13421963#comment-13421963 ] terry zhang commented on HBASE-6446: hi Daniel ,what about check the hlog queue length? {code:title=ReplicationSource.java|borderStyle=solid} catch (IOException ioe) { if(this.queue.size() != 0){ LOG.warn(peerClusterZnode + Got: , ioe); } // TODO Need a better way to determinate if a file is really gone but // TODO without scanning all logs dir if (sleepMultiplier == this.maxRetriesMultiplier) { LOG.warn(Waited too long for this file, considering dumping); return !processEndOfFile(); } } {code} this can prevent warning exception print again and again when master rs startup and no data was put in. Replication source will throw EOF exception when hlog size is 0 --- Key: HBASE-6446 URL: https://issues.apache.org/jira/browse/HBASE-6446 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Attachments: hbase-6446.patch when master cluster startup new hlog which size is 0 will be created. if we start replication, replication source will print many EOF exception when openreader. I think we need to ignore this case and do not print so many exception warning log . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0
[ https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] terry zhang updated HBASE-6446: --- Attachment: hbase-6446-v2.patch OK,Daniel. Let's change it from warning level to debug level to prevent warning in online cluster. Replication source will throw EOF exception when hlog size is 0 --- Key: HBASE-6446 URL: https://issues.apache.org/jira/browse/HBASE-6446 Project: HBase Issue Type: Bug Components: replication Affects Versions: 0.94.0 Reporter: terry zhang Attachments: hbase-6446-v2.patch, hbase-6446.patch when master cluster startup new hlog which size is 0 will be created. if we start replication, replication source will print many EOF exception when openreader. I think we need to ignore this case and do not print so many exception warning log . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira