[jira] [Commented] (HBASE-6430) Few modifications in section 2.4.2.1 of Apache HBase Reference Guide
[ https://issues.apache.org/jira/browse/HBASE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418749#comment-13418749 ] Mohammad Tariq Iqbal commented on HBASE-6430: - Thanks a lot for the support stack. I'll go through the link provided by you. I have made following changes, in case the attachment was ambiguous(I should have done it before hand. My bad) - 1- Addition of 'core-site.xml' file to point out how to give the value of 'hbase.rootdir' property so that HMaster can contact the NameNode properly. 2- /etc/hosts file modification to avoid loopback problem (as proper DNS resolution is very important in order to get Hbase work properly). 3- Modification of hbase-env.sh file to enable the use of Hbase's Zookeeper. 4- Addition of 'hbase.cluster.distributed' and 'hbase.zookeeper.property.clientPort' properties in conf/hbase-site.xml. 5- Copying hadoop-core-*.jar and commons-collections-3.2.1.jar from HADOOP_HOME/lib folder into the HBASE_HOME/lib folder to avoid any compatibility issues between Hadoop and Hbase. Apologies for my ignorance. Many thanks. Few modifications in section 2.4.2.1 of Apache HBase Reference Guide Key: HBASE-6430 URL: https://issues.apache.org/jira/browse/HBASE-6430 Project: HBase Issue Type: Improvement Reporter: Mohammad Tariq Iqbal Priority: Minor Attachments: HBASE-6430.txt Quite often, newbies face some issues while configuring Hbase in pseudo distributed mode. I was no exception. I would like to propose some solutions for these problems which worked for me. If the community finds it appropriate, I would like to apply the patch for the same. This is the first time I am trying to do something like this, so please pardon me if I have put it in an appropriate manner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418750#comment-13418750 ] Aditya Kishore commented on HBASE-6389: --- @Stack No, the current patch does not modify the way a live RS is evaluated, but it ensures that the dying RS's thread is actually dead before moving forward. {quote} What is the below changing doing? conf.setInt(hbase.master.wait.on.regionservers.mintostart, numSlaves); conf.setInt(hbase.master.wait.on.regionservers.maxtostart, numSlaves); + String count = String.valueOf(numSlaves); + conf.setIfUnset(hbase.master.wait.on.regionservers.mintostart, count); + conf.setIfUnset(hbase.master.wait.on.regionservers.maxtostart, count); {quote} This change was to preserve the values of 'mintostart' and 'maxtostart' in the configuration if the caller of HBaseTestingUtility.startMiniHBaseCluster(int, int) has set them (which was the case with TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS failure). Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration().
[jira] [Resolved] (HBASE-6325) [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive
[ https://issues.apache.org/jira/browse/HBASE-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-6325. --- Resolution: Fixed Fix Version/s: (was: 0.90.8) Hadoop Flags: Reviewed Committed to 0.92, 0.94 and trunk. Not caring about 0.90 either. [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive - Key: HBASE-6325 URL: https://issues.apache.org/jira/browse/HBASE-6325 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: HBASE-6325-0.92-v2.patch, HBASE-6325-0.92.patch Yet another bug found during the leap second madness, it's possible to miss the registration of new region servers so that in ReplicationSourceManager.init we start the failover of a live and replicating region server. I don't think there's data loss but the RS that's being failed over will die on: {noformat} 2012-07-01 06:25:15,604 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server sv4r23s48,10304,1341112194623: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/sv4r23s48,10304,1341112194623/4/sv4r23s48%2C10304%2C1341112194623.1341112195369 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:655) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:697) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:470) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:607) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:368) {noformat} It seems to me that just refreshing {{otherRegionServers}} after getting the list of {{currentReplicators}} would be enough to fix this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418752#comment-13418752 ] Zhihong Ted Yu commented on HBASE-6389: --- Looking at https://builds.apache.org/job/PreCommit-HBASE-Build/2406/console, there was still some hanging test although I wasn't able to find which test hung. Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count minToStart) ){ .. {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4470) ServerNotRunningException coming out of assignRootAndMeta kills the Master
[ https://issues.apache.org/jira/browse/HBASE-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418753#comment-13418753 ] Hadoop QA commented on HBASE-4470: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537246/HBASE-4470-v2-trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2413//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2413//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2413//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2413//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2413//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2413//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2413//console This message is automatically generated. ServerNotRunningException coming out of assignRootAndMeta kills the Master -- Key: HBASE-4470 URL: https://issues.apache.org/jira/browse/HBASE-4470 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: Gregory Chanan Priority: Critical Fix For: 0.90.7 Attachments: HBASE-4470-90.patch, HBASE-4470-v2-90.patch, HBASE-4470-v2-92_94.patch, HBASE-4470-v2-trunk.patch I'm surprised we still have issues like that and I didn't get a hit while googling so forgive me if there's already a jira about it. When the master starts it verifies the locations of root and meta before assigning them, if the server is started but not running you'll get this: {quote} 2011-09-23 04:47:44,859 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: RemoteException connecting to RS org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ipc.ServerNotRunningException: Server is not running yet at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1038) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy6.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:969) at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:388) at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:287) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:484) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:441) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:388) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:282) {quote} I hit that 3-4 times this week while debugging something else. The worst is that when you restart the master it sees that as a failover, but none of the regions are assigned so it takes an eternity to get back fully online. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Resolved] (HBASE-6319) ReplicationSource can call terminate on itself and deadlock
[ https://issues.apache.org/jira/browse/HBASE-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-6319. --- Resolution: Fixed Fix Version/s: (was: 0.90.8) Hadoop Flags: Reviewed Committed to 0.92 and 0.94, skipping 0.90 like HBASE-6325. Trunk was already fixed. ReplicationSource can call terminate on itself and deadlock --- Key: HBASE-6319 URL: https://issues.apache.org/jira/browse/HBASE-6319 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.94.1 Attachments: HBASE-6319-0.92.patch In a few places in the ReplicationSource code calls terminate on itself which is a problem since in terminate() we wait on that thread to die. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5966) MapReduce based tests broken on Hadoop 2.0.0-alpha
[ https://issues.apache.org/jira/browse/HBASE-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gregory Chanan updated HBASE-5966: -- Attachment: HBASE-5966-94.patch Attached patch for 0.94. Ran TestTableMapReduce against both 1.0 and 2.0 hadoop profiles, both passed: mvn test -PlocalTests -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce --- T E S T S --- Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 188.087 sec Results : Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 mvn test -PlocalTests -Dhadoop.profile=2.0 -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce --- T E S T S --- Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 167.49 sec Results : Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 MapReduce based tests broken on Hadoop 2.0.0-alpha -- Key: HBASE-5966 URL: https://issues.apache.org/jira/browse/HBASE-5966 Project: HBase Issue Type: Bug Components: mapred, mapreduce, test Affects Versions: 0.94.0, 0.96.0 Environment: Hadoop 2.0.0-alpha-SNAPSHOT, HBase 0.94.0-SNAPSHOT, Ubuntu 12.04 LTS (GNU/Linux 3.2.0-24-generic x86_64) Reporter: Andrew Purtell Assignee: Jimmy Xiang Fix For: 0.96.0, 0.94.1 Attachments: HBASE-5966-1.patch, HBASE-5966-94.patch, HBASE-5966.patch, hbase-5966.patch Some fairly recent change in Hadoop 2.0.0-alpha has broken our MapReduce test rigging. Below is a representative error, can be easily reproduced with: {noformat} mvn -PlocalTests -Psecurity \ -Dhadoop.profile=23 -Dhadoop.version=2.0.0-SNAPSHOT \ clean test \ -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce {noformat} And the result: {noformat} --- T E S T S --- Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! --- Test set: org.apache.hadoop.hbase.mapreduce.TestTableMapReduce --- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! testMultiRegionTable(org.apache.hadoop.hbase.mapreduce.TestTableMapReduce) Time elapsed: 21.935 sec ERROR! java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getNewApplication(ClientRMProtocolPBClientImpl.java:134) at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:183) at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:216) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:339) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1226) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1223) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1223) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1244) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.runTestOnTable(TestTableMapReduce.java:151) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.testMultiRegionTable(TestTableMapReduce.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418766#comment-13418766 ] stack commented on HBASE-6389: -- @Aditya Makes sense. You got what you needed from Ted? Let us know. Thanks. Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count minToStart) ){ .. {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418767#comment-13418767 ] Hadoop QA commented on HBASE-6389: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537258/org.apache.hadoop.hbase.TestZooKeeper-output.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2415//console This message is automatically generated. Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count
[jira] [Commented] (HBASE-6393) Decouple audit event creation from storage in AccessController
[ https://issues.apache.org/jira/browse/HBASE-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418768#comment-13418768 ] Hadoop QA commented on HBASE-6393: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537256/hbase-6393-v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 15 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2414//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2414//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2414//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2414//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2414//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2414//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2414//console This message is automatically generated. Decouple audit event creation from storage in AccessController -- Key: HBASE-6393 URL: https://issues.apache.org/jira/browse/HBASE-6393 Project: HBase Issue Type: Brainstorming Components: security Affects Versions: 0.96.0 Reporter: Marcelo Vanzin Attachments: hbase-6393-v1.patch Currently, AccessControler takes care of both generating audit events (by performing access checks) and storing them (by creating a log message and writing it to the AUDITLOG logger). This makes the logging system the only way to catch audit events. It means that if someone wants to do something fancier (like writing these records to a database somewhere), they need to hack through the logging system, and parse the messages generated by AccessController, which is not optimal. The attached patch decouples generation and storage by introducing a new interface, used by AccessController, to log the audit events. The current, log-based storage is kept in place so that current users won't be affected by the change. I'm filing this as an RFC at this point, so the patch is not totally clean; it's on top of HBase 0.92 (which is easier for me to test) and doesn't have any unit tests, for starters. But the changes should be very similar on trunk - I don't remember changes in this particular area of the code between those versions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6417) hbck merges .META. regions if there's an old leftover
[ https://issues.apache.org/jira/browse/HBASE-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418769#comment-13418769 ] Jonathan Hsieh commented on HBASE-6417: --- Did you keep a copy of the hbck details before you ran the -repair option? hbck merges .META. regions if there's an old leftover - Key: HBASE-6417 URL: https://issues.apache.org/jira/browse/HBASE-6417 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Fix For: 0.96.0, 0.94.2 Attachments: hbck.log Trying to see what caused HBASE-6310, one of the things I figured is that the bad .META. row is actually one from the time that we were permitting meta splitting and that folder had just been staying there for a while. So I tried to recreate the issue with -repair and it merged my good .META. region with the one that's 3 years old that also has the same start key. I ended up with a brand new .META. region! I'll be attaching the full log in a separate file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418771#comment-13418771 ] Lars Hofhansl commented on HBASE-6389: -- I'd like to leave this with 0.94.2. Unless you think this must go into 0.94.1 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count minToStart) ){ .. {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5966) MapReduce based tests broken on Hadoop 2.0.0-alpha
[ https://issues.apache.org/jira/browse/HBASE-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418772#comment-13418772 ] Jimmy Xiang commented on HBASE-5966: looks good to me, will commit to 0.94 tonight if no objection. MapReduce based tests broken on Hadoop 2.0.0-alpha -- Key: HBASE-5966 URL: https://issues.apache.org/jira/browse/HBASE-5966 Project: HBase Issue Type: Bug Components: mapred, mapreduce, test Affects Versions: 0.94.0, 0.96.0 Environment: Hadoop 2.0.0-alpha-SNAPSHOT, HBase 0.94.0-SNAPSHOT, Ubuntu 12.04 LTS (GNU/Linux 3.2.0-24-generic x86_64) Reporter: Andrew Purtell Assignee: Jimmy Xiang Fix For: 0.96.0, 0.94.1 Attachments: HBASE-5966-1.patch, HBASE-5966-94.patch, HBASE-5966.patch, hbase-5966.patch Some fairly recent change in Hadoop 2.0.0-alpha has broken our MapReduce test rigging. Below is a representative error, can be easily reproduced with: {noformat} mvn -PlocalTests -Psecurity \ -Dhadoop.profile=23 -Dhadoop.version=2.0.0-SNAPSHOT \ clean test \ -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce {noformat} And the result: {noformat} --- T E S T S --- Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! --- Test set: org.apache.hadoop.hbase.mapreduce.TestTableMapReduce --- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! testMultiRegionTable(org.apache.hadoop.hbase.mapreduce.TestTableMapReduce) Time elapsed: 21.935 sec ERROR! java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getNewApplication(ClientRMProtocolPBClientImpl.java:134) at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:183) at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:216) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:339) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1226) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1223) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1223) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1244) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.runTestOnTable(TestTableMapReduce.java:151) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.testMultiRegionTable(TestTableMapReduce.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at
[jira] [Commented] (HBASE-6417) hbck merges .META. regions if there's an old leftover
[ https://issues.apache.org/jira/browse/HBASE-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418774#comment-13418774 ] Jean-Daniel Cryans commented on HBASE-6417: --- No, but I can reproduce. hbck merges .META. regions if there's an old leftover - Key: HBASE-6417 URL: https://issues.apache.org/jira/browse/HBASE-6417 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Fix For: 0.96.0, 0.94.2 Attachments: hbck.log Trying to see what caused HBASE-6310, one of the things I figured is that the bad .META. row is actually one from the time that we were permitting meta splitting and that folder had just been staying there for a while. So I tried to recreate the issue with -repair and it merged my good .META. region with the one that's 3 years old that also has the same start key. I ended up with a brand new .META. region! I'll be attaching the full log in a separate file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6310) -ROOT- corruption when .META. is using the old encoding scheme
[ https://issues.apache.org/jira/browse/HBASE-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418777#comment-13418777 ] Jonathan Hsieh commented on HBASE-6310: --- hbck writes directly to .META. but I don't think it ever writes to root unless you put the -metaonly flag on. It may be possible that if there were two .META. region dirs, hbck tried to pull in the old .META. dir. This would probably write something goofy to .META though. If you just used the -repair option, it would have first tried to merge regions before modifying meta. (but also would likely have not modified ROOT). -ROOT- corruption when .META. is using the old encoding scheme -- Key: HBASE-6310 URL: https://issues.apache.org/jira/browse/HBASE-6310 Project: HBase Issue Type: Improvement Affects Versions: 0.94.0 Reporter: Jean-Daniel Cryans Priority: Blocker Fix For: 0.96.0, 0.94.2 We're still working the on the root cause here, but after the leap second armageddon we had a hard time getting our 0.94 cluster back up. This is what we saw in the logs until the master died by itself: {noformat} 2012-07-01 23:01:52,149 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=-ROOT-, metaLocation={region=-ROOT-,,0.70236052, hostname=sfor3s28, port=10304}, attempt=16 of 100 failed; retrying after sleep of 32000 because: HRegionInfo was null or empty in -ROOT-, row=keyvalues={.META.,,1259448304806/info:server/1341124914705/Put/vlen=14/ts=0, .META.,,1259448304806/info:serverstartcode/1341124914705/Put/vlen=8/ts=0} {noformat} (it's strage that we retry this) This was really misleading because I could see the regioninfo in a scan: {noformat} hbase(main):002:0 scan '-ROOT-' ROW COLUMN+CELL .META.,,1column=info:regioninfo, timestamp=1331755381142, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:server, timestamp=1341183448693, value=sfor3s40:10304 .META.,,1 column=info:serverstartcode, timestamp=1341183448693, value=1341183444689 .META.,,1column=info:v, timestamp=1331755419291, value=\x00\x00 .META.,,1259448304806column=info:server, timestamp=1341124914705, value=sfor3s24:10304 .META.,,1259448304806 column=info:serverstartcode, timestamp=1341124914705, value=1341124455863 {noformat} Except that the devil is in the details, .META.,,1 is not .META.,,1259448304806. Basically something writes to .META. by directly creating the row key without caring if the row is in the old format. I did a deleteall in the shell and it fixed the issue... until some time later it was stuck again because the edits reappeared (still not sure why). This time the PostOpenDeployTasksThread were stuck in the RS trying to update .META. but there was no logging (saw it with a jstack). I deleted the row again to make it work. I'm marking this as a blocker against 0.94.2 since we're trying to get 0.94.1 out, but I wouldn't recommend upgrading to 0.94 if your cluster was created before 0.89 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418789#comment-13418789 ] Aditya Kishore commented on HBASE-6389: --- My vote was for its inclusion for 2 reasons. # This was a behavior change in 0.94.0 and I am not sure we have completely understood its impact. # In a large MSLAB enabled cluster, I have repeatedly seen all the regions (in excess of 5K with *Σ*~i=1..n~(*R*~i~*CF*~i~) 8K; with MSLAB on, RS needs 16G just to open) being assigned to a single region server leading it to OOM crash and creating quite a few HBCK inconsistencies on subsequent recovery. Lastly, so far all the test failures seems to be due to errors in the test code unmasked by this change. Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval
[jira] [Commented] (HBASE-3432) [hbck] Add remove table switch
[ https://issues.apache.org/jira/browse/HBASE-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418790#comment-13418790 ] Jonathan Hsieh commented on HBASE-3432: --- [~vamshi] root and meta are special regions but regions nonetheless. They get assigned to arbitrary (possibly different) region servers, and are hit on every new client's read and write path. [~juneng603] /hbase/uassigned is where Regions-in-transitions informatin is kept. These are modified as regions are being assigned to particular region servers. They coordinate the state between the master assigning and then RS assignee. [hbck] Add remove table switch Key: HBASE-3432 URL: https://issues.apache.org/jira/browse/HBASE-3432 Project: HBase Issue Type: New Feature Components: util Affects Versions: 0.89.20100924 Reporter: Lars George Priority: Minor This happened before and I am not sure how the new Master improves on it (this stuff is only available between the lines are buried in some peoples heads - one other thing I wish was for a better place to communicate what each path improves). Just so we do not miss it, there is an issue that sometimes disabling large tables simply times out and the table gets stuck in limbo. From the CDH User list: {quote} On Fri, Jan 7, 2011 at 1:57 PM, Sean Sechrist ssechr...@gmail.com wrote: To get them out of META, you can just scan '.META.' for that table name, and delete those rows. We had to do that a few months ago. -Sean That did it. For the benefit of others, here's code. Beware the literal table names, run at your own peril. {quote} {code} import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Delete; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.MetaScanner; import org.apache.hadoop.hbase.util.Bytes; public class CleanFromMeta { public static class Cleaner implements MetaScanner.MetaScannerVisitor { public HTable meta = null; public Cleaner(Configuration conf) throws IOException { meta = new HTable(conf, .META.); } public boolean processRow(Result rowResult) throws IOException { String r = new String(rowResult.getRow()); if (r.startsWith(webtable,)) { meta.delete(new Delete(rowResult.getRow())); System.out.println(Deleting row + rowResult); } return true; } } public static void main(String[] args) throws Exception { String tname = .META.; Configuration conf = HBaseConfiguration.create(); MetaScanner.metaScan(conf, new Cleaner(conf), Bytes.toBytes(webtable)); } } {code} I suggest to move this into HBaseFsck. I do not like personally to have these JRuby scripts floating around that may or may not help. This should be available if a user gets stuck and knows what he is doing (they can delete from .META. anyways). Maybe a \-\-disable-table tablename \-\-force or so? But since disable is already in the shell we could add an \-\-force there? Or add a \-\-delete-table tablename to the hbck? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3432) [hbck] Add remove table switch
[ https://issues.apache.org/jira/browse/HBASE-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418792#comment-13418792 ] Jonathan Hsieh commented on HBASE-3432: --- [juneng603] eventually, after region assignments are completed and the region is opened on the target RS, information is updated in the META table so that other clients can go to the proper RS. [hbck] Add remove table switch Key: HBASE-3432 URL: https://issues.apache.org/jira/browse/HBASE-3432 Project: HBase Issue Type: New Feature Components: util Affects Versions: 0.89.20100924 Reporter: Lars George Priority: Minor This happened before and I am not sure how the new Master improves on it (this stuff is only available between the lines are buried in some peoples heads - one other thing I wish was for a better place to communicate what each path improves). Just so we do not miss it, there is an issue that sometimes disabling large tables simply times out and the table gets stuck in limbo. From the CDH User list: {quote} On Fri, Jan 7, 2011 at 1:57 PM, Sean Sechrist ssechr...@gmail.com wrote: To get them out of META, you can just scan '.META.' for that table name, and delete those rows. We had to do that a few months ago. -Sean That did it. For the benefit of others, here's code. Beware the literal table names, run at your own peril. {quote} {code} import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Delete; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.MetaScanner; import org.apache.hadoop.hbase.util.Bytes; public class CleanFromMeta { public static class Cleaner implements MetaScanner.MetaScannerVisitor { public HTable meta = null; public Cleaner(Configuration conf) throws IOException { meta = new HTable(conf, .META.); } public boolean processRow(Result rowResult) throws IOException { String r = new String(rowResult.getRow()); if (r.startsWith(webtable,)) { meta.delete(new Delete(rowResult.getRow())); System.out.println(Deleting row + rowResult); } return true; } } public static void main(String[] args) throws Exception { String tname = .META.; Configuration conf = HBaseConfiguration.create(); MetaScanner.metaScan(conf, new Cleaner(conf), Bytes.toBytes(webtable)); } } {code} I suggest to move this into HBaseFsck. I do not like personally to have these JRuby scripts floating around that may or may not help. This should be available if a user gets stuck and knows what he is doing (they can delete from .META. anyways). Maybe a \-\-disable-table tablename \-\-force or so? But since disable is already in the shell we could add an \-\-force there? Or add a \-\-delete-table tablename to the hbck? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5966) MapReduce based tests broken on Hadoop 2.0.0-alpha
[ https://issues.apache.org/jira/browse/HBASE-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418793#comment-13418793 ] Lars Hofhansl commented on HBASE-5966: -- +1 MapReduce based tests broken on Hadoop 2.0.0-alpha -- Key: HBASE-5966 URL: https://issues.apache.org/jira/browse/HBASE-5966 Project: HBase Issue Type: Bug Components: mapred, mapreduce, test Affects Versions: 0.94.0, 0.96.0 Environment: Hadoop 2.0.0-alpha-SNAPSHOT, HBase 0.94.0-SNAPSHOT, Ubuntu 12.04 LTS (GNU/Linux 3.2.0-24-generic x86_64) Reporter: Andrew Purtell Assignee: Jimmy Xiang Fix For: 0.96.0, 0.94.1 Attachments: HBASE-5966-1.patch, HBASE-5966-94.patch, HBASE-5966.patch, hbase-5966.patch Some fairly recent change in Hadoop 2.0.0-alpha has broken our MapReduce test rigging. Below is a representative error, can be easily reproduced with: {noformat} mvn -PlocalTests -Psecurity \ -Dhadoop.profile=23 -Dhadoop.version=2.0.0-SNAPSHOT \ clean test \ -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce {noformat} And the result: {noformat} --- T E S T S --- Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! --- Test set: org.apache.hadoop.hbase.mapreduce.TestTableMapReduce --- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! testMultiRegionTable(org.apache.hadoop.hbase.mapreduce.TestTableMapReduce) Time elapsed: 21.935 sec ERROR! java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getNewApplication(ClientRMProtocolPBClientImpl.java:134) at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:183) at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:216) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:339) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1226) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1223) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1223) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1244) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.runTestOnTable(TestTableMapReduce.java:151) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.testMultiRegionTable(TestTableMapReduce.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
[jira] [Commented] (HBASE-4956) Control direct memory buffer consumption by HBaseClient
[ https://issues.apache.org/jira/browse/HBASE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418797#comment-13418797 ] Hudson commented on HBASE-4956: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #100 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/100/]) HBASE-4956 Control direct memory buffer consumption by HBaseClient (Bob Copeland) (Revision 1363526) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/client/Result.java Control direct memory buffer consumption by HBaseClient --- Key: HBASE-4956 URL: https://issues.apache.org/jira/browse/HBASE-4956 Project: HBase Issue Type: New Feature Reporter: Ted Yu Assignee: Bob Copeland Fix For: 0.96.0, 0.94.1 Attachments: 4956.txt, thread_get.rb As Jonathan explained here https://groups.google.com/group/asynchbase/browse_thread/thread/c45bc7ba788b2357?pli=1 , standard hbase client inadvertently consumes large amount of direct memory. We should consider using netty for NIO-related tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6312) Make BlockCache eviction thresholds configurable
[ https://issues.apache.org/jira/browse/HBASE-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418798#comment-13418798 ] Hudson commented on HBASE-6312: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #100 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/100/]) HBASE-6312 Make BlockCache eviction thresholds configurable (Jie Huang) (Revision 1363468) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/DoubleBlockCache.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java Make BlockCache eviction thresholds configurable Key: HBASE-6312 URL: https://issues.apache.org/jira/browse/HBASE-6312 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jie Huang Assignee: Jie Huang Priority: Minor Fix For: 0.96.0 Attachments: hbase-6312.patch, hbase-6312_v2.patch, hbase-6312_v3.patch Some of our customers found that tuning the BlockCache eviction thresholds made test results different in their test environment. However, those thresholds are not configurable in the current implementation. The only way to change those values is to re-compile the HBase source code. We wonder if it is possible to make them configurable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6325) [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive
[ https://issues.apache.org/jira/browse/HBASE-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418799#comment-13418799 ] Hudson commented on HBASE-6325: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #100 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/100/]) HBASE-6325 [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive (Revision 1363573) Result = FAILURE jdcryans : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive - Key: HBASE-6325 URL: https://issues.apache.org/jira/browse/HBASE-6325 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: HBASE-6325-0.92-v2.patch, HBASE-6325-0.92.patch Yet another bug found during the leap second madness, it's possible to miss the registration of new region servers so that in ReplicationSourceManager.init we start the failover of a live and replicating region server. I don't think there's data loss but the RS that's being failed over will die on: {noformat} 2012-07-01 06:25:15,604 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server sv4r23s48,10304,1341112194623: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/sv4r23s48,10304,1341112194623/4/sv4r23s48%2C10304%2C1341112194623.1341112195369 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:655) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:697) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:470) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:607) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:368) {noformat} It seems to me that just refreshing {{otherRegionServers}} after getting the list of {{currentReplicators}} would be enough to fix this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (HBASE-6276) TestClassLoading is racy
[ https://issues.apache.org/jira/browse/HBASE-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell reopened HBASE-6276: --- Assignee: (was: Andrew Purtell) TestClassLoading is racy Key: HBASE-6276 URL: https://issues.apache.org/jira/browse/HBASE-6276 Project: HBase Issue Type: Bug Components: coprocessors, test Affects Versions: 0.92.2, 0.96.0, 0.94.1 Reporter: Andrew Purtell Priority: Minor Attachments: HBASE-6276-0.94.patch, HBASE-6276.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6319) ReplicationSource can call terminate on itself and deadlock
[ https://issues.apache.org/jira/browse/HBASE-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418802#comment-13418802 ] Hudson commented on HBASE-6319: --- Integrated in HBase-0.94 #343 (See [https://builds.apache.org/job/HBase-0.94/343/]) HBASE-6319 ReplicationSource can call terminate on itself and deadlock HBASE-6325 [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive (Revision 1363570) Result = SUCCESS jdcryans : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java ReplicationSource can call terminate on itself and deadlock --- Key: HBASE-6319 URL: https://issues.apache.org/jira/browse/HBASE-6319 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.94.1 Attachments: HBASE-6319-0.92.patch In a few places in the ReplicationSource code calls terminate on itself which is a problem since in terminate() we wait on that thread to die. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6325) [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive
[ https://issues.apache.org/jira/browse/HBASE-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418803#comment-13418803 ] Hudson commented on HBASE-6325: --- Integrated in HBase-0.94 #343 (See [https://builds.apache.org/job/HBase-0.94/343/]) HBASE-6319 ReplicationSource can call terminate on itself and deadlock HBASE-6325 [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive (Revision 1363570) Result = SUCCESS jdcryans : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive - Key: HBASE-6325 URL: https://issues.apache.org/jira/browse/HBASE-6325 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: HBASE-6325-0.92-v2.patch, HBASE-6325-0.92.patch Yet another bug found during the leap second madness, it's possible to miss the registration of new region servers so that in ReplicationSourceManager.init we start the failover of a live and replicating region server. I don't think there's data loss but the RS that's being failed over will die on: {noformat} 2012-07-01 06:25:15,604 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server sv4r23s48,10304,1341112194623: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/sv4r23s48,10304,1341112194623/4/sv4r23s48%2C10304%2C1341112194623.1341112195369 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:655) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:697) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:470) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:607) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:368) {noformat} It seems to me that just refreshing {{otherRegionServers}} after getting the list of {{currentReplicators}} would be enough to fix this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-5966) MapReduce based tests broken on Hadoop 2.0.0-alpha
[ https://issues.apache.org/jira/browse/HBASE-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang resolved HBASE-5966. Resolution: Fixed Integrated to 0.94. Thank Greg for the patch, Lars for the review. MapReduce based tests broken on Hadoop 2.0.0-alpha -- Key: HBASE-5966 URL: https://issues.apache.org/jira/browse/HBASE-5966 Project: HBase Issue Type: Bug Components: mapred, mapreduce, test Affects Versions: 0.94.0, 0.96.0 Environment: Hadoop 2.0.0-alpha-SNAPSHOT, HBase 0.94.0-SNAPSHOT, Ubuntu 12.04 LTS (GNU/Linux 3.2.0-24-generic x86_64) Reporter: Andrew Purtell Assignee: Jimmy Xiang Fix For: 0.96.0, 0.94.1 Attachments: HBASE-5966-1.patch, HBASE-5966-94.patch, HBASE-5966.patch, hbase-5966.patch Some fairly recent change in Hadoop 2.0.0-alpha has broken our MapReduce test rigging. Below is a representative error, can be easily reproduced with: {noformat} mvn -PlocalTests -Psecurity \ -Dhadoop.profile=23 -Dhadoop.version=2.0.0-SNAPSHOT \ clean test \ -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce {noformat} And the result: {noformat} --- T E S T S --- Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! --- Test set: org.apache.hadoop.hbase.mapreduce.TestTableMapReduce --- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! testMultiRegionTable(org.apache.hadoop.hbase.mapreduce.TestTableMapReduce) Time elapsed: 21.935 sec ERROR! java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getNewApplication(ClientRMProtocolPBClientImpl.java:134) at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:183) at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:216) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:339) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1226) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1223) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1223) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1244) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.runTestOnTable(TestTableMapReduce.java:151) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.testMultiRegionTable(TestTableMapReduce.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418808#comment-13418808 ] Lars Hofhansl commented on HBASE-6389: -- @Aditya: I do agree. (see my comment about how I'm sure the logic of this change is correct). It now seems, though, that it is the default timeout that is too short (4.5s). Folks with 5k regions should know to increase the minToStart parameter and the timeout. We should document that better. I can also see to change the timeout to failure condition (as discussed above). I'm not opposed. It's just that 0.94.1 needs to go out because of HBASE-6311, I do not want to risk delaying this further. It also seems this can use further discussion. (Sometimes it is amazing how much discussion as two change can cause :) ) @Ted and @Stack: What do you guys think? Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while
[jira] [Comment Edited] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418808#comment-13418808 ] Lars Hofhansl edited comment on HBASE-6389 at 7/19/12 11:47 PM: @Aditya: I do agree. (see my comment about how I'm sure the logic of this change is correct). It now seems, though, that it is the default timeout that is too short (4.5s). Folks with 5k regions should know to increase the minToStart parameter and the timeout. We should document that better. I can also see to change the timeout to failure condition (as discussed above). I'm not opposed. It's just that 0.94.1 needs to go out because of HBASE-6311, I do not want to risk delaying this further. It also seems this can use further discussion. (Sometimes it is amazing how much discussion a two line change can cause :) ) @Ted and @Stack: What do you guys think? Edit: Spelling. was (Author: lhofhansl): @Aditya: I do agree. (see my comment about how I'm sure the logic of this change is correct). It now seems, though, that it is the default timeout that is too short (4.5s). Folks with 5k regions should know to increase the minToStart parameter and the timeout. We should document that better. I can also see to change the timeout to failure condition (as discussed above). I'm not opposed. It's just that 0.94.1 needs to go out because of HBASE-6311, I do not want to risk delaying this further. It also seems this can use further discussion. (Sometimes it is amazing how much discussion as two change can cause :) ) @Ted and @Stack: What do you guys think? Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418812#comment-13418812 ] Aditya Kishore commented on HBASE-6389: --- @Lars Completely agree and definitely would not want to hold 0.94.1 for this. (That's why My vote *was*... :) ). Documentation can take care of this in 0.94.1 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count minToStart) ){ .. {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6325) [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive
[ https://issues.apache.org/jira/browse/HBASE-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418813#comment-13418813 ] Hudson commented on HBASE-6325: --- Integrated in HBase-TRUNK #3154 (See [https://builds.apache.org/job/HBase-TRUNK/3154/]) HBASE-6325 [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive (Revision 1363573) Result = SUCCESS jdcryans : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive - Key: HBASE-6325 URL: https://issues.apache.org/jira/browse/HBASE-6325 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: HBASE-6325-0.92-v2.patch, HBASE-6325-0.92.patch Yet another bug found during the leap second madness, it's possible to miss the registration of new region servers so that in ReplicationSourceManager.init we start the failover of a live and replicating region server. I don't think there's data loss but the RS that's being failed over will die on: {noformat} 2012-07-01 06:25:15,604 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server sv4r23s48,10304,1341112194623: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/sv4r23s48,10304,1341112194623/4/sv4r23s48%2C10304%2C1341112194623.1341112195369 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:655) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:697) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:470) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:607) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:368) {noformat} It seems to me that just refreshing {{otherRegionServers}} after getting the list of {{currentReplicators}} would be enough to fix this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418815#comment-13418815 ] Lars Hofhansl commented on HBASE-6389: -- :) didn't pick up on the was Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count minToStart) ){ .. {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6405) Create Hadoop compatibilty modules and Metrics2 implementation of replication metrics
[ https://issues.apache.org/jira/browse/HBASE-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-6405: - Resolution: Fixed Status: Resolved (was: Patch Available) Create Hadoop compatibilty modules and Metrics2 implementation of replication metrics - Key: HBASE-6405 URL: https://issues.apache.org/jira/browse/HBASE-6405 Project: HBase Issue Type: Sub-task Reporter: Zhihong Ted Yu Assignee: Elliott Clark Fix For: 0.96.0 Attachments: 6405.txt, HBASE-6405-ADD.patch, hbase-6405-addendum-2-v2.patch, hbase-6405-addendum-2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6431) Some FilterList Constructors break addFilter
Alex Newman created HBASE-6431: -- Summary: Some FilterList Constructors break addFilter Key: HBASE-6431 URL: https://issues.apache.org/jira/browse/HBASE-6431 Project: HBase Issue Type: Bug Reporter: Alex Newman Assignee: Alex Newman Some of the constructors for FilterList set the internal list of filters to list types which don't support the add operation. As a result FilterList(final ListFilter rowFilters) FilterList(final Filter... rowFilters) FilterList(final Operator operator, final ListFilter rowFilters) FilterList(final Operator operator, final Filter... rowFilters) may init private ListFilter filters = new ArrayListFilter(); incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6431) Some FilterList Constructors break addFilter
[ https://issues.apache.org/jira/browse/HBASE-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Newman updated HBASE-6431: --- Attachment: 0001-HBASE-6431.-Some-FilterList-Constructors-break-addFi.patch Some FilterList Constructors break addFilter Key: HBASE-6431 URL: https://issues.apache.org/jira/browse/HBASE-6431 Project: HBase Issue Type: Bug Reporter: Alex Newman Assignee: Alex Newman Attachments: 0001-HBASE-6431.-Some-FilterList-Constructors-break-addFi.patch Some of the constructors for FilterList set the internal list of filters to list types which don't support the add operation. As a result FilterList(final ListFilter rowFilters) FilterList(final Filter... rowFilters) FilterList(final Operator operator, final ListFilter rowFilters) FilterList(final Operator operator, final Filter... rowFilters) may init private ListFilter filters = new ArrayListFilter(); incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6411) Move Master Metrics to metrics 2
[ https://issues.apache.org/jira/browse/HBASE-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-6411: - Assignee: Elliott Clark (was: Alex Baranau) Status: Patch Available (was: Open) Move Master Metrics to metrics 2 Key: HBASE-6411 URL: https://issues.apache.org/jira/browse/HBASE-6411 Project: HBase Issue Type: Sub-task Reporter: Elliott Clark Assignee: Elliott Clark Attachments: HBASE-6411-0.patch, HBASE-6411_concept.patch Move Master Metrics to metrics 2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6411) Move Master Metrics to metrics 2
[ https://issues.apache.org/jira/browse/HBASE-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-6411: - Attachment: HBASE-6411-0.patch Here's a working implementation of master with metrics2. It includes some tests but not a whole lot. I plan to include a lot more once I am able to inject test metricsources (HBASE-6407). It doesn't include histograms of the split size (HBASE-6409). Move Master Metrics to metrics 2 Key: HBASE-6411 URL: https://issues.apache.org/jira/browse/HBASE-6411 Project: HBase Issue Type: Sub-task Reporter: Elliott Clark Assignee: Alex Baranau Attachments: HBASE-6411-0.patch, HBASE-6411_concept.patch Move Master Metrics to metrics 2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6431) Some FilterList Constructors break addFilter
[ https://issues.apache.org/jira/browse/HBASE-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Newman updated HBASE-6431: --- Status: Patch Available (was: Open) Some FilterList Constructors break addFilter Key: HBASE-6431 URL: https://issues.apache.org/jira/browse/HBASE-6431 Project: HBase Issue Type: Bug Reporter: Alex Newman Assignee: Alex Newman Attachments: 0001-HBASE-6431.-Some-FilterList-Constructors-break-addFi.patch Some of the constructors for FilterList set the internal list of filters to list types which don't support the add operation. As a result FilterList(final ListFilter rowFilters) FilterList(final Filter... rowFilters) FilterList(final Operator operator, final ListFilter rowFilters) FilterList(final Operator operator, final Filter... rowFilters) may init private ListFilter filters = new ArrayListFilter(); incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6429) Filter with filterRow() returning true is also incompatible with scan with limit
[ https://issues.apache.org/jira/browse/HBASE-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418830#comment-13418830 ] Jie Huang commented on HBASE-6429: -- Oops.I will fix those 2 failures and regenerate the patch soon. Thanks Ted. Filter with filterRow() returning true is also incompatible with scan with limit Key: HBASE-6429 URL: https://issues.apache.org/jira/browse/HBASE-6429 Project: HBase Issue Type: Bug Components: filters Affects Versions: 0.96.0 Reporter: Jason Dai Attachments: hbase-6429_0_94_0.patch Currently if we scan with bot limit and a Filter with filterRow(ListKeyValue) implemented, an IncompatibleFilterException will be thrown. The same exception should also be thrown if the filer has its filterRow() implemented. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6431) Some FilterList Constructors break addFilter
[ https://issues.apache.org/jira/browse/HBASE-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Newman updated HBASE-6431: --- Priority: Minor (was: Major) Some FilterList Constructors break addFilter Key: HBASE-6431 URL: https://issues.apache.org/jira/browse/HBASE-6431 Project: HBase Issue Type: Bug Reporter: Alex Newman Assignee: Alex Newman Priority: Minor Attachments: 0001-HBASE-6431.-Some-FilterList-Constructors-break-addFi.patch Some of the constructors for FilterList set the internal list of filters to list types which don't support the add operation. As a result FilterList(final ListFilter rowFilters) FilterList(final Filter... rowFilters) FilterList(final Operator operator, final ListFilter rowFilters) FilterList(final Operator operator, final Filter... rowFilters) may init private ListFilter filters = new ArrayListFilter(); incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6431) Some FilterList Constructors break addFilter
[ https://issues.apache.org/jira/browse/HBASE-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Newman updated HBASE-6431: --- Component/s: filters Affects Version/s: 0.92.1 0.94.0 Some FilterList Constructors break addFilter Key: HBASE-6431 URL: https://issues.apache.org/jira/browse/HBASE-6431 Project: HBase Issue Type: Bug Components: filters Affects Versions: 0.92.1, 0.94.0 Reporter: Alex Newman Assignee: Alex Newman Priority: Minor Attachments: 0001-HBASE-6431.-Some-FilterList-Constructors-break-addFi.patch Some of the constructors for FilterList set the internal list of filters to list types which don't support the add operation. As a result FilterList(final ListFilter rowFilters) FilterList(final Filter... rowFilters) FilterList(final Operator operator, final ListFilter rowFilters) FilterList(final Operator operator, final Filter... rowFilters) may init private ListFilter filters = new ArrayListFilter(); incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5966) MapReduce based tests broken on Hadoop 2.0.0-alpha
[ https://issues.apache.org/jira/browse/HBASE-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418842#comment-13418842 ] Hudson commented on HBASE-5966: --- Integrated in HBase-0.94 #344 (See [https://builds.apache.org/job/HBase-0.94/344/]) HBASE-5966 MapReduce based tests broken on Hadoop 2.0.0-alpha (Gregory Chanan) (Revision 1363586) Result = FAILURE jxiang : Files : * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java MapReduce based tests broken on Hadoop 2.0.0-alpha -- Key: HBASE-5966 URL: https://issues.apache.org/jira/browse/HBASE-5966 Project: HBase Issue Type: Bug Components: mapred, mapreduce, test Affects Versions: 0.94.0, 0.96.0 Environment: Hadoop 2.0.0-alpha-SNAPSHOT, HBase 0.94.0-SNAPSHOT, Ubuntu 12.04 LTS (GNU/Linux 3.2.0-24-generic x86_64) Reporter: Andrew Purtell Assignee: Jimmy Xiang Fix For: 0.96.0, 0.94.1 Attachments: HBASE-5966-1.patch, HBASE-5966-94.patch, HBASE-5966.patch, hbase-5966.patch Some fairly recent change in Hadoop 2.0.0-alpha has broken our MapReduce test rigging. Below is a representative error, can be easily reproduced with: {noformat} mvn -PlocalTests -Psecurity \ -Dhadoop.profile=23 -Dhadoop.version=2.0.0-SNAPSHOT \ clean test \ -Dtest=org.apache.hadoop.hbase.mapreduce.TestTableMapReduce {noformat} And the result: {noformat} --- T E S T S --- Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! --- Test set: org.apache.hadoop.hbase.mapreduce.TestTableMapReduce --- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.292 sec FAILURE! testMultiRegionTable(org.apache.hadoop.hbase.mapreduce.TestTableMapReduce) Time elapsed: 21.935 sec ERROR! java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135) at org.apache.hadoop.yarn.api.impl.pb.client.ClientRMProtocolPBClientImpl.getNewApplication(ClientRMProtocolPBClientImpl.java:134) at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:183) at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:216) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:339) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1226) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1223) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1223) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1244) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.runTestOnTable(TestTableMapReduce.java:151) at org.apache.hadoop.hbase.mapreduce.TestTableMapReduce.testMultiRegionTable(TestTableMapReduce.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47) at org.junit.rules.RunRules.evaluate(RunRules.java:18) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at
[jira] [Commented] (HBASE-6386) Audit log messages do not include column family / qualifier information consistently
[ https://issues.apache.org/jira/browse/HBASE-6386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418844#comment-13418844 ] Marcelo Vanzin commented on HBASE-6386: --- Other methods also seem to suffer from similar issues; for example, preIncrementColumnValue does this: {code} requirePermission(TablePermission.Action.WRITE, c.getEnvironment(), Arrays.asList(new byte[][]{family})); {code} Even though there is a qualifier argument; so the qualifier information never makes it to the audit log. It also kinda sucks that there's no standard family map type for all these operations, so to come up with one common type for auditing, you'd have to make copies of that data (or use ugly wrapper objects). Audit log messages do not include column family / qualifier information consistently Key: HBASE-6386 URL: https://issues.apache.org/jira/browse/HBASE-6386 Project: HBase Issue Type: Improvement Components: security Reporter: Marcelo Vanzin The code related to this issue is in AccessController.java:permissionGranted(). When creating audit logs, that method will do one of the following: * grant access, create audit log with table name only * deny access because of table permission, create audit log with table name only * deny access because of column family / qualifier permission, create audit log with specific family / qualifier So, in the case where more than one column family and/or qualifier are in the same request, there will be a loss of information. Even in the case where only one column family and/or qualifier is involved, information may be lost. It would be better if this behavior consistently included all the information in the request; regardless of access being granted or denied, and regardless which permission caused the denial, the column family and qualifier info should be part of the audit log message. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover
[ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418863#comment-13418863 ] Gregory Chanan commented on HBASE-5843: --- Looks great so far, nkeywal. Some questions: {quote} 2) Kill -9 of a RS; wait for all regions to become online again: 0.92: 980s 0.96: ~13s = The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but should bring similar results. {quote} I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right? Could you clarify? {quote} 3) Start of the cluster after a clean stop; wait for all regions to become online. 0.92: ~1020s 0.94: ~1023s (tested once only) 0.96: ~31s = The benefit is visible at startup = This does not come from something implemented for 0.94 {quote} Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? (since I assume HBASE-5844 and HBASE-5926 do not apply in this case). {quote} 7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long it takes to have all the regions available. 0.92) 180s detection time+ then hangs twice out of 2 tests. 0.96) 14s (hangs once out of 3) = There's a bug {quote} Has a JIRA been filed? {quote} Test to be changed to get a real difference when we need to replay the wal. {quote} Could you clarify what you mean here? Improve HBase MTTR - Mean Time To Recover - Key: HBASE-5843 URL: https://issues.apache.org/jira/browse/HBASE-5843 Project: HBase Issue Type: Umbrella Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit The ideal target is: - failure impact client applications only by an added delay to execute a query, whatever the failure. - this delay is always inferior to 1 second. We're not going to achieve that immediately... Priority will be given to the most frequent issues. Short term: - software crash - standard administrative tasks as stop/start of a cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418866#comment-13418866 ] Zhihong Ted Yu commented on HBASE-6389: --- I ran test suite with latest patch on trunk and got: {code} Running org.apache.hadoop.hbase.client.TestHCM Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.265 sec FAILURE! -- Running org.apache.hadoop.hbase.client.TestAdmin Tests run: 40, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 322.872 sec FAILURE! -- Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 134.193 sec FAILURE! -- Running org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine Tests run: 20, Failures: 5, Errors: 2, Skipped: 0, Time elapsed: 669.588 sec FAILURE! {code} There was one hanging test: {code} at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) {code} BTW what do R sub i, C and F sub i represent in the formula above ? Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart =
[jira] [Updated] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6389: -- Attachment: testReplication.jstack jstack for the hanging TestReplication Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count minToStart) ){ .. {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418866#comment-13418866 ] Zhihong Ted Yu edited comment on HBASE-6389 at 7/20/12 1:37 AM: I ran test suite with latest patch on trunk and got: {code} Running org.apache.hadoop.hbase.client.TestHCM Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.265 sec FAILURE! -- Running org.apache.hadoop.hbase.client.TestAdmin Tests run: 40, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 322.872 sec FAILURE! -- Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 134.193 sec FAILURE! -- Running org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine Tests run: 20, Failures: 5, Errors: 2, Skipped: 0, Time elapsed: 669.588 sec FAILURE! {code} There was one hanging test: {code} at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) {code} BTW what do R~i~, C and F~i~ represent in the formula above ? was (Author: zhi...@ebaysf.com): I ran test suite with latest patch on trunk and got: {code} Running org.apache.hadoop.hbase.client.TestHCM Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.265 sec FAILURE! -- Running org.apache.hadoop.hbase.client.TestAdmin Tests run: 40, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 322.872 sec FAILURE! -- Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 134.193 sec FAILURE! -- Running org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine Tests run: 20, Failures: 5, Errors: 2, Skipped: 0, Time elapsed: 669.588 sec FAILURE! {code} There was one hanging test: {code} at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) {code} BTW what do R sub i, C and F sub i represent in the formula above ? Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To
[jira] [Comment Edited] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418866#comment-13418866 ] Zhihong Ted Yu edited comment on HBASE-6389 at 7/20/12 1:41 AM: I ran test suite with latest patch on trunk and got: {code} Running org.apache.hadoop.hbase.client.TestHCM Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.265 sec FAILURE! -- Running org.apache.hadoop.hbase.client.TestAdmin Tests run: 40, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 322.872 sec FAILURE! -- Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 134.193 sec FAILURE! -- Running org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine Tests run: 20, Failures: 5, Errors: 2, Skipped: 0, Time elapsed: 669.588 sec FAILURE! {code} There was one hanging test: {code} at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) {code} BTW what do *R*~i~, C and *F*~i~ represent in the formula above ? was (Author: zhi...@ebaysf.com): I ran test suite with latest patch on trunk and got: {code} Running org.apache.hadoop.hbase.client.TestHCM Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.265 sec FAILURE! -- Running org.apache.hadoop.hbase.client.TestAdmin Tests run: 40, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 322.872 sec FAILURE! -- Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 134.193 sec FAILURE! -- Running org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine Tests run: 20, Failures: 5, Errors: 2, Skipped: 0, Time elapsed: 669.588 sec FAILURE! {code} There was one hanging test: {code} at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) {code} BTW what do R~i~, C and F~i~ represent in the formula above ? Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To
[jira] [Updated] (HBASE-6363) HBaseConfiguration can carry a main method that dumps XML output for debug purposes
[ https://issues.apache.org/jira/browse/HBASE-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shengsheng Huang updated HBASE-6363: Attachment: HBASE-6363.2.patch Updated the patch according to @Harsh's comments. Actually we did the patch for automation purposes. Http master/dump contains much more information than we needed. HBaseConfiguration can carry a main method that dumps XML output for debug purposes --- Key: HBASE-6363 URL: https://issues.apache.org/jira/browse/HBASE-6363 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.94.0 Reporter: Harsh J Priority: Trivial Labels: conf, newbie, noob Attachments: HBASE-6363.2.patch, HBASE-6363.patch Just like the Configuration class carries a main() method in it, that simply loads itself and writes XML out to System.out, HBaseConfiguration can use the same kinda method. That way we can do hbase org.apache.hadoop.….HBaseConfiguration to get an Xml dump of things HBaseConfiguration has properly loaded. Nifty in checking app classpaths sometimes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6325) [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive
[ https://issues.apache.org/jira/browse/HBASE-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418880#comment-13418880 ] Hudson commented on HBASE-6325: --- Integrated in HBase-0.92 #480 (See [https://builds.apache.org/job/HBase-0.92/480/]) HBASE-6319 ReplicationSource can call terminate on itself and deadlock HBASE-6325 [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive (Revision 1363571) Result = FAILURE jdcryans : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive - Key: HBASE-6325 URL: https://issues.apache.org/jira/browse/HBASE-6325 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: HBASE-6325-0.92-v2.patch, HBASE-6325-0.92.patch Yet another bug found during the leap second madness, it's possible to miss the registration of new region servers so that in ReplicationSourceManager.init we start the failover of a live and replicating region server. I don't think there's data loss but the RS that's being failed over will die on: {noformat} 2012-07-01 06:25:15,604 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server sv4r23s48,10304,1341112194623: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/sv4r23s48,10304,1341112194623/4/sv4r23s48%2C10304%2C1341112194623.1341112195369 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:655) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:697) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:470) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:607) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:368) {noformat} It seems to me that just refreshing {{otherRegionServers}} after getting the list of {{currentReplicators}} would be enough to fix this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6319) ReplicationSource can call terminate on itself and deadlock
[ https://issues.apache.org/jira/browse/HBASE-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418879#comment-13418879 ] Hudson commented on HBASE-6319: --- Integrated in HBase-0.92 #480 (See [https://builds.apache.org/job/HBase-0.92/480/]) HBASE-6319 ReplicationSource can call terminate on itself and deadlock HBASE-6325 [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive (Revision 1363571) Result = FAILURE jdcryans : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java ReplicationSource can call terminate on itself and deadlock --- Key: HBASE-6319 URL: https://issues.apache.org/jira/browse/HBASE-6319 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.92.2, 0.94.1 Attachments: HBASE-6319-0.92.patch In a few places in the ReplicationSource code calls terminate on itself which is a problem since in terminate() we wait on that thread to die. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6363) HBaseConfiguration can carry a main method that dumps XML output for debug purposes
[ https://issues.apache.org/jira/browse/HBASE-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418882#comment-13418882 ] Harsh J commented on HBASE-6363: Thanks again Shengsheng. The /dump servlet is more verbose than the simple XML given by /conf servlet. If its just config you need, /conf is where you need to go to, not /dump. But for the sake of debuggability, suggesting /dump in the javadoc does seem fine to do for HBase. I think the patch looks good. If needed, we can switch /dump with /conf (since we're discussing just configs, not env. info as well), but otherwise I think it does what the goal of this report was. Thanks again! HBaseConfiguration can carry a main method that dumps XML output for debug purposes --- Key: HBASE-6363 URL: https://issues.apache.org/jira/browse/HBASE-6363 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.94.0 Reporter: Harsh J Priority: Trivial Labels: conf, newbie, noob Attachments: HBASE-6363.2.patch, HBASE-6363.patch Just like the Configuration class carries a main() method in it, that simply loads itself and writes XML out to System.out, HBaseConfiguration can use the same kinda method. That way we can do hbase org.apache.hadoop.….HBaseConfiguration to get an Xml dump of things HBaseConfiguration has properly loaded. Nifty in checking app classpaths sometimes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6417) hbck merges .META. regions if there's an old leftover
[ https://issues.apache.org/jira/browse/HBASE-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418887#comment-13418887 ] Jonathan Hsieh commented on HBASE-6417: --- Feels like we could add an option to not do repairs on META unless forced to. hbck merges .META. regions if there's an old leftover - Key: HBASE-6417 URL: https://issues.apache.org/jira/browse/HBASE-6417 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Fix For: 0.96.0, 0.94.2 Attachments: hbck.log Trying to see what caused HBASE-6310, one of the things I figured is that the bad .META. row is actually one from the time that we were permitting meta splitting and that folder had just been staying there for a while. So I tried to recreate the issue with -repair and it merged my good .META. region with the one that's 3 years old that also has the same start key. I ended up with a brand new .META. region! I'll be attaching the full log in a separate file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6411) Move Master Metrics to metrics 2
[ https://issues.apache.org/jira/browse/HBASE-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418891#comment-13418891 ] Alex Baranau commented on HBASE-6411: - Glanced over your patch. I like this way better (over initial patch at 4050): exposing the real interface of MetricsSource (in this case master metrics). I.e. with methods defines, not empty + hashmap. 1. What do you think about having MasterMetricsFactory available through compat module (created by CompatibilitySingletonFactory?) which is creating MetricsSource, like this: interface MasterMetricsFactory { MasterMetricsSource create(final String name, final String sessionId); } This way we could pass parameters and control creation of metrics source. 2. Independent on the above: how about removing BaseMetricsSource interface from compat as we don't really need it with explicit definition of metrics in sources? This way current BaseMetricsSourceImpl could be renamed to MetricsRegistry and used via composition (as a field) in metrics sources instead of realization. Thus, creating initializing of the sources which might be different for each could stay in metrics source implementation itself. Including deciding on using JvmMetricsSource (I assume not every source should create it), etc. This way they would look as normal metricsSources from hadoop codebase, just that they will use hbase's MetricsRegistry which allows metrics removals. Thoughts? Move Master Metrics to metrics 2 Key: HBASE-6411 URL: https://issues.apache.org/jira/browse/HBASE-6411 Project: HBase Issue Type: Sub-task Reporter: Elliott Clark Assignee: Elliott Clark Attachments: HBASE-6411-0.patch, HBASE-6411_concept.patch Move Master Metrics to metrics 2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3725) HBase increments from old value after delete and write to disk
[ https://issues.apache.org/jira/browse/HBASE-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418892#comment-13418892 ] ShiXing commented on HBASE-3725: @Ted bq. I generate a region with 3 store files. The increment slows from 1810 tps to 1020 tps, it slows 43.6%, . The tps is increment the same rowkey. The performance depends on how frequently the memstore flushed to the store file. If I also do the same test case, the latest patch's performance is same as the orig, because the increment rowkey is always in the memstore, and we do not need to read the store file. The difference is only for the rowKey that can't get the value from memstore, it need do a more read from memstore , compared to the 0.92 trunk: read only from store file. You must know, the orig's high performance is just benefit by only read from the memstore. HBase increments from old value after delete and write to disk -- Key: HBASE-3725 URL: https://issues.apache.org/jira/browse/HBASE-3725 Project: HBase Issue Type: Bug Components: io, regionserver Affects Versions: 0.90.1 Reporter: Nathaniel Cook Assignee: Jonathan Gray Attachments: HBASE-3725-0.92-V1.patch, HBASE-3725-0.92-V2.patch, HBASE-3725-0.92-V3.patch, HBASE-3725-0.92-V4.patch, HBASE-3725-0.92-V5.patch, HBASE-3725-Test-v1.patch, HBASE-3725-v3.patch, HBASE-3725.patch Deleted row values are sometimes used for starting points on new increments. To reproduce: Create a row r. Set column x to some default value. Force hbase to write that value to the file system (such as restarting the cluster). Delete the row. Call table.incrementColumnValue with some_value Get the row. The returned value in the column was incremented from the old value before the row was deleted instead of being initialized to some_value. Code to reproduce: {code} import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Delete; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.HTablePool; import org.apache.hadoop.hbase.client.Increment; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.util.Bytes; public class HBaseTestIncrement { static String tableName = testIncrement; static byte[] infoCF = Bytes.toBytes(info); static byte[] rowKey = Bytes.toBytes(test-rowKey); static byte[] newInc = Bytes.toBytes(new); static byte[] oldInc = Bytes.toBytes(old); /** * This code reproduces a bug with increment column values in hbase * Usage: First run part one by passing '1' as the first arg *Then restart the hbase cluster so it writes everything to disk *Run part two by passing '2' as the first arg * * This will result in the old deleted data being found and used for the increment calls * * @param args * @throws IOException */ public static void main(String[] args) throws IOException { if(1.equals(args[0])) partOne(); if(2.equals(args[0])) partTwo(); if (both.equals(args[0])) { partOne(); partTwo(); } } /** * Creates a table and increments a column value 10 times by 10 each time. * Results in a value of 100 for the column * * @throws IOException */ static void partOne()throws IOException { Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); HTableDescriptor tableDesc = new HTableDescriptor(tableName); tableDesc.addFamily(new HColumnDescriptor(infoCF)); if(admin.tableExists(tableName)) { admin.disableTable(tableName); admin.deleteTable(tableName); } admin.createTable(tableDesc); HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE); HTableInterface table = pool.getTable(Bytes.toBytes(tableName)); //Increment unitialized column for (int j = 0; j 10; j++) { table.incrementColumnValue(rowKey, infoCF, oldInc, (long)10); Increment inc = new
[jira] [Comment Edited] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418866#comment-13418866 ] Zhihong Ted Yu edited comment on HBASE-6389 at 7/20/12 2:53 AM: I ran test suite with latest patch on trunk and got: {code} Failed tests: testRunThriftServer[12](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:1 but was:0 testRunThriftServer[14](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:1 but was:0 testRunThriftServer[15](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:1 but was:0 testRunThriftServer[16](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:1 but was:0 testRunThriftServer[17](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): expected:1 but was:0 Tests in error: testRegionCaching(org.apache.hadoop.hbase.client.TestHCM): org.apache.hadoop.hbase.UnknownRegionException: bd992463917ba68fe5389c5bf9e94a3a testCloseRegionThatFetchesTheHRIFromMeta(org.apache.hadoop.hbase.client.TestAdmin): -1 testTableExists(org.apache.hadoop.hbase.catalog.TestMetaReaderEditor): org.apache.hadoop.hbase.TableNotEnabledException: testTableExists testRunThriftServer[11](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed out after 6 milliseconds testRunThriftServer[13](org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine): test timed out after 6 milliseconds {code} There was one hanging test: {code} at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) {code} BTW what do *R*~i~, C and *F*~i~ represent in the formula above ? was (Author: zhi...@ebaysf.com): I ran test suite with latest patch on trunk and got: {code} Running org.apache.hadoop.hbase.client.TestHCM Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.265 sec FAILURE! -- Running org.apache.hadoop.hbase.client.TestAdmin Tests run: 40, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 322.872 sec FAILURE! -- Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 134.193 sec FAILURE! -- Running org.apache.hadoop.hbase.thrift.TestThriftServerCmdLine Tests run: 20, Failures: 5, Errors: 2, Skipped: 0, Time elapsed: 669.588 sec FAILURE! {code} There was one hanging test: {code} at org.apache.hadoop.hbase.replication.TestReplication.setUp(TestReplication.java:183) {code} BTW what do *R*~i~, C and *F*~i~ represent in the formula above ? Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614
[jira] [Commented] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418895#comment-13418895 ] Hadoop QA commented on HBASE-6389: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537286/testReplication.jstack against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2416//console This message is automatically generated. Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count
[jira] [Commented] (HBASE-6411) Move Master Metrics to metrics 2
[ https://issues.apache.org/jira/browse/HBASE-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418898#comment-13418898 ] Alex Baranau commented on HBASE-6411: - Looks like you reassigned the task, so I should probably not touch the patch to avoid intersection, right? Was going to add actual metrics tests (which test metrics values changes in addition to testing factories/classes loading) and perhaps apply the 2nd point above, if it makes sense to you. Move Master Metrics to metrics 2 Key: HBASE-6411 URL: https://issues.apache.org/jira/browse/HBASE-6411 Project: HBase Issue Type: Sub-task Reporter: Elliott Clark Assignee: Elliott Clark Attachments: HBASE-6411-0.patch, HBASE-6411_concept.patch Move Master Metrics to metrics 2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6389) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments
[ https://issues.apache.org/jira/browse/HBASE-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6389: -- Status: Open (was: Patch Available) Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments Key: HBASE-6389 URL: https://issues.apache.org/jira/browse/HBASE-6389 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.96.0 Reporter: Aditya Kishore Assignee: Aditya Kishore Priority: Critical Fix For: 0.96.0, 0.94.2 Attachments: HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, HBASE-6389_trunk.patch, org.apache.hadoop.hbase.TestZooKeeper-output.txt, testReplication.jstack Continuing from HBASE-6375. It seems I was mistaken in my assumption that changing the value of hbase.master.wait.on.regionservers.mintostart to a sufficient number (from default of 1) can help prevent assignment of all regions to one (or a small number of) region server(s). While this was the case in 0.90.x and 0.92.x, the behavior has changed in 0.94.0 onwards to address HBASE-4993. From 0.94.0 onwards, Master will proceed immediately after the timeout has lapsed, even if hbase.master.wait.on.regionservers.mintostart has not reached. Reading the current conditions of waitForRegionServers() clarifies it {code:title=ServerManager.java (trunk rev:1360470)} 581 /** 582 * Wait for the region servers to report in. 583 * We will wait until one of this condition is met: 584 * - the master is stopped 585 * - the 'hbase.master.wait.on.regionservers.timeout' is reached 586 * - the 'hbase.master.wait.on.regionservers.maxtostart' number of 587 *region servers is reached 588 * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND 589 * there have been no new region server in for 590 * 'hbase.master.wait.on.regionservers.interval' time 591 * 592 * @throws InterruptedException 593 */ 594 public void waitForRegionServers(MonitoredTask status) 595 throws InterruptedException { 612 while ( 613 !this.master.isStopped() 614 slept timeout 615 count maxToStart 616 (lastCountChange+interval now || count minToStart) 617 ){ {code} So with the current conditions, the wait will end as soon as timeout is reached even lesser number of RS have checked-in with the Master and the master will proceed with the region assignment among these RSes alone. As mentioned in -[HBASE-4993|https://issues.apache.org/jira/browse/HBASE-4993?focusedCommentId=13237196#comment-13237196]-, and I concur, this could have disastrous effect in large cluster especially now that MSLAB is turned on. To enforce the required quorum as specified by hbase.master.wait.on.regionservers.mintostart irrespective of timeout, these conditions need to be modified as following {code:title=ServerManager.java} .. /** * Wait for the region servers to report in. * We will wait until one of this condition is met: * - the master is stopped * - the 'hbase.master.wait.on.regionservers.maxtostart' number of *region servers is reached * - the 'hbase.master.wait.on.regionservers.mintostart' is reached AND * there have been no new region server in for * 'hbase.master.wait.on.regionservers.interval' time AND * the 'hbase.master.wait.on.regionservers.timeout' is reached * * @throws InterruptedException */ public void waitForRegionServers(MonitoredTask status) .. .. int minToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.mintostart, 1); int maxToStart = this.master.getConfiguration(). getInt(hbase.master.wait.on.regionservers.maxtostart, Integer.MAX_VALUE); if (maxToStart minToStart) { maxToStart = minToStart; } .. .. while ( !this.master.isStopped() count maxToStart (lastCountChange+interval now || timeout slept || count minToStart) ){ .. {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6363) HBaseConfiguration can carry a main method that dumps XML output for debug purposes
[ https://issues.apache.org/jira/browse/HBASE-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418900#comment-13418900 ] Shengsheng Huang commented on HBASE-6363: - Thanks very much for clarification Harsh. It seems /conf is only added into Hadoop since release 0.21 (HADOOP-6408). As we're using hadoop v1 it didn't work at our local cluster. We would consider adding HADOOP-6408 patch into our local hadoop branch. After all, servlet config dump would contain all the configuration changes in code. Anyway, do you think it worth a seperate servlet to dump configuration as xml only? Or reorganize the dump output into more consistent format to make it easier for automatic parsing? HBaseConfiguration can carry a main method that dumps XML output for debug purposes --- Key: HBASE-6363 URL: https://issues.apache.org/jira/browse/HBASE-6363 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.94.0 Reporter: Harsh J Priority: Trivial Labels: conf, newbie, noob Attachments: HBASE-6363.2.patch, HBASE-6363.patch Just like the Configuration class carries a main() method in it, that simply loads itself and writes XML out to System.out, HBaseConfiguration can use the same kinda method. That way we can do hbase org.apache.hadoop.….HBaseConfiguration to get an Xml dump of things HBaseConfiguration has properly loaded. Nifty in checking app classpaths sometimes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3725) HBase increments from old value after delete and write to disk
[ https://issues.apache.org/jira/browse/HBASE-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418904#comment-13418904 ] Zhihong Ted Yu commented on HBASE-3725: --- Looking at existing code: {code} private ListKeyValue getLastIncrement(final Get get) throws IOException { InternalScan iscan = new InternalScan(get); {code} iscan was assigned at the beginning. Looks like the assignment in else block is redundant. TestHRegion#testIncrementWithFlushAndDelete passed without that assignment. HBase increments from old value after delete and write to disk -- Key: HBASE-3725 URL: https://issues.apache.org/jira/browse/HBASE-3725 Project: HBase Issue Type: Bug Components: io, regionserver Affects Versions: 0.90.1 Reporter: Nathaniel Cook Assignee: Jonathan Gray Attachments: HBASE-3725-0.92-V1.patch, HBASE-3725-0.92-V2.patch, HBASE-3725-0.92-V3.patch, HBASE-3725-0.92-V4.patch, HBASE-3725-0.92-V5.patch, HBASE-3725-Test-v1.patch, HBASE-3725-v3.patch, HBASE-3725.patch Deleted row values are sometimes used for starting points on new increments. To reproduce: Create a row r. Set column x to some default value. Force hbase to write that value to the file system (such as restarting the cluster). Delete the row. Call table.incrementColumnValue with some_value Get the row. The returned value in the column was incremented from the old value before the row was deleted instead of being initialized to some_value. Code to reproduce: {code} import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Delete; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.HTablePool; import org.apache.hadoop.hbase.client.Increment; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.util.Bytes; public class HBaseTestIncrement { static String tableName = testIncrement; static byte[] infoCF = Bytes.toBytes(info); static byte[] rowKey = Bytes.toBytes(test-rowKey); static byte[] newInc = Bytes.toBytes(new); static byte[] oldInc = Bytes.toBytes(old); /** * This code reproduces a bug with increment column values in hbase * Usage: First run part one by passing '1' as the first arg *Then restart the hbase cluster so it writes everything to disk *Run part two by passing '2' as the first arg * * This will result in the old deleted data being found and used for the increment calls * * @param args * @throws IOException */ public static void main(String[] args) throws IOException { if(1.equals(args[0])) partOne(); if(2.equals(args[0])) partTwo(); if (both.equals(args[0])) { partOne(); partTwo(); } } /** * Creates a table and increments a column value 10 times by 10 each time. * Results in a value of 100 for the column * * @throws IOException */ static void partOne()throws IOException { Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); HTableDescriptor tableDesc = new HTableDescriptor(tableName); tableDesc.addFamily(new HColumnDescriptor(infoCF)); if(admin.tableExists(tableName)) { admin.disableTable(tableName); admin.deleteTable(tableName); } admin.createTable(tableDesc); HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE); HTableInterface table = pool.getTable(Bytes.toBytes(tableName)); //Increment unitialized column for (int j = 0; j 10; j++) { table.incrementColumnValue(rowKey, infoCF, oldInc, (long)10); Increment inc = new Increment(rowKey); inc.addColumn(infoCF, newInc, (long)10); table.increment(inc); } Get get = new Get(rowKey); Result r = table.get(get); System.out.println(initial values: new + Bytes.toLong(r.getValue(infoCF, newInc)) + old +
[jira] [Resolved] (HBASE-6345) Utilize fault injection in testing using AspectJ
[ https://issues.apache.org/jira/browse/HBASE-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu resolved HBASE-6345. --- Resolution: Won't Fix There was not enough incentive to pursue fault injection using AspectJ. Utilize fault injection in testing using AspectJ Key: HBASE-6345 URL: https://issues.apache.org/jira/browse/HBASE-6345 Project: HBase Issue Type: Bug Reporter: Zhihong Ted Yu HDFS uses fault injection to test pipeline failure in addition to mock, spy. HBase uses mock, spy. But there are cases where mock, spy aren't convenient. Some example from DFSClientAspects.aj : {code} pointcut pipelineInitNonAppend(DataStreamer datastreamer): callCreateBlockOutputStream(datastreamer) cflow(execution(* nextBlockOutputStream(..))) within(DataStreamer); after(DataStreamer datastreamer) returning : pipelineInitNonAppend(datastreamer) { LOG.info(FI: after pipelineInitNonAppend: hasError= + datastreamer.hasError + errorIndex= + datastreamer.errorIndex); if (datastreamer.hasError) { DataTransferTest dtTest = DataTransferTestUtil.getDataTransferTest(); if (dtTest != null) dtTest.fiPipelineInitErrorNonAppend.run(datastreamer.errorIndex); } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6363) HBaseConfiguration can carry a main method that dumps XML output for debug purposes
[ https://issues.apache.org/jira/browse/HBASE-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418917#comment-13418917 ] Harsh J commented on HBASE-6363: Sorry, I didn't notice 1.x didn't have it! (I checked only against my 2.x installation, and CDH3 here seems to have had it backported at some point too). Instead of working around, I think we can rather backport it to a v1 future release, via: HADOOP-8567. HBaseConfiguration can carry a main method that dumps XML output for debug purposes --- Key: HBASE-6363 URL: https://issues.apache.org/jira/browse/HBASE-6363 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.94.0 Reporter: Harsh J Priority: Trivial Labels: conf, newbie, noob Attachments: HBASE-6363.2.patch, HBASE-6363.patch Just like the Configuration class carries a main() method in it, that simply loads itself and writes XML out to System.out, HBaseConfiguration can use the same kinda method. That way we can do hbase org.apache.hadoop.….HBaseConfiguration to get an Xml dump of things HBaseConfiguration has properly loaded. Nifty in checking app classpaths sometimes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover
[ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418919#comment-13418919 ] nkeywal commented on HBASE-5843: bq. I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right? Could you clarify? Yes, it's because with a clean stop, the RS unregisters itself in ZK, so the recovery starts immediately. With a kill -9, the RS remains registered in ZK. So if you don't have HBASE-5844 or HBASE-5926, you wait for the ZK timeout. bq. Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? Yes. bq. Has a JIRA been filed? Not yet. I'm writing specific unit tests for this, I found issues that I have not yet fully analyzed, and I need to create the jiras. Also, may be my test was not good for this part: as I was doing the test without a datanode, it could be that the recovery was not working for this reason (I wonder if the sync works with the local file system for example). bq. Test to be changed to get a real difference when we need to replay the wal. bq. Could you clarify what you mean here? It's does not last long enough, so I won't be able to see much difference even if there is one. So I need to redo the work with a real datanode, check that it recovers, then check that I measure something meaningful. I will also redo the first tests with a DN to see if there is still a gap. Improve HBase MTTR - Mean Time To Recover - Key: HBASE-5843 URL: https://issues.apache.org/jira/browse/HBASE-5843 Project: HBase Issue Type: Umbrella Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit The ideal target is: - failure impact client applications only by an added delay to execute a query, whatever the failure. - this delay is always inferior to 1 second. We're not going to achieve that immediately... Priority will be given to the most frequent issues. Short term: - software crash - standard administrative tasks as stop/start of a cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6432) HRegionServer doesn't properly set clusterId in conf
Francis Liu created HBASE-6432: -- Summary: HRegionServer doesn't properly set clusterId in conf Key: HBASE-6432 URL: https://issues.apache.org/jira/browse/HBASE-6432 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.96.0 ClusterId is normally set into the passed conf during instantiation of an HTable class. In the case of a HRegionServer this is bypassed and set to default since getMaster() bypasses the class which sets clusterID clusterId since it uses HBaseRPC to create the proxy to create the proxy directly. This becomes a problem with clients (ie within a coprocessor) using delegation tokens for authentication. Since the token's service will be the correct clusterId and while the TokenSelector is looking for one with service default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6432) HRegionServer doesn't properly set clusterId in conf
[ https://issues.apache.org/jira/browse/HBASE-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francis Liu updated HBASE-6432: --- Attachment: HBASE-6432_94.patch a patch for 0.94 to get feedback on the approach. Things changed significant enough in trunk to need a separate patch. I'm hoping to get this backported to 0.94 since it is needed for security. HRegionServer doesn't properly set clusterId in conf Key: HBASE-6432 URL: https://issues.apache.org/jira/browse/HBASE-6432 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.96.0 Attachments: HBASE-6432_94.patch ClusterId is normally set into the passed conf during instantiation of an HTable class. In the case of a HRegionServer this is bypassed and set to default since getMaster() bypasses the class which sets clusterID clusterId since it uses HBaseRPC to create the proxy to create the proxy directly. This becomes a problem with clients (ie within a coprocessor) using delegation tokens for authentication. Since the token's service will be the correct clusterId and while the TokenSelector is looking for one with service default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6427) Pluggable policy for smallestReadPoint in HRegion
[ https://issues.apache.org/jira/browse/HBASE-6427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418922#comment-13418922 ] Lars Hofhansl commented on HBASE-6427: -- Let me clarify what I mean by this: If I wanted to implement an MVCC based optimistic transaction engine on top of HBase I would naturally want to use HBase's built in versioning (where possible). In that case it is not clear a priori how many versions to keep or for how long (i.e. specifying VERSION/TTL is too static). The outside engine would need to determine that. The simplest of all approaches would be to do that via the smallestReadpoint in each region, by making its determination pluggable. Pluggable policy for smallestReadPoint in HRegion - Key: HBASE-6427 URL: https://issues.apache.org/jira/browse/HBASE-6427 Project: HBase Issue Type: New Feature Reporter: Lars Hofhansl Priority: Minor When implementing higher level stores on top of HBase it is necessary to allow dynamic control over how long KVs must be kept around. Semi-static config options for ColumnFamilies (# of version or TTL) is not sufficient. The simplest way to achieve this is to have a pluggable class to determine the smallestReadpoint for Region. That way outside code can control what KVs to retain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6411) Move Master Metrics to metrics 2
[ https://issues.apache.org/jira/browse/HBASE-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418923#comment-13418923 ] Elliott Clark commented on HBASE-6411: -- Sorry didn't mean to re-assign. I must have done that when submitting to hadoop qa. Sorry I didn't mean to step on any toes. I agree that a metrics factory or something like it could be very useful. However like I said above I was hoping to take a crack using guice to do most of the factory stuff. However maybe until I get that up it would be useful. On #2 I don't think removing them interface completely is really the way to go since both the replication metrics and the region server metrics are mostly dynamic metrics; ie they aren't pre-created like the master metrics. I think it still makes sense to have a source that's mostly focused on those map based metrics. Move Master Metrics to metrics 2 Key: HBASE-6411 URL: https://issues.apache.org/jira/browse/HBASE-6411 Project: HBase Issue Type: Sub-task Reporter: Elliott Clark Assignee: Elliott Clark Attachments: HBASE-6411-0.patch, HBASE-6411_concept.patch Move Master Metrics to metrics 2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6428) Pluggable Compaction policies
[ https://issues.apache.org/jira/browse/HBASE-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418924#comment-13418924 ] Lars Hofhansl commented on HBASE-6428: -- Another way of looking at this is a possible policy that considers all HFile in terms of a baseline + changes on top of that baseline. (For the record: I am not saying that I will do this any time soon, just recording this as an idea). Pluggable Compaction policies - Key: HBASE-6428 URL: https://issues.apache.org/jira/browse/HBASE-6428 Project: HBase Issue Type: New Feature Reporter: Lars Hofhansl For some usecases is useful to allow more control over how KVs get compacted. For example one could envision storing old versions of a KV separate HFiles, which then rarely have to be touched/cached by queries querying for new data. In addition these date ranged HFile can be easily used for backups while maintaining historical data. This would be a major change, allowing compactions to provide multiple targets (not just a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6406) TestReplicationPeer.testResetZooKeeperSession and TestZooKeeper.testClientSessionExpired fail frequently
[ https://issues.apache.org/jira/browse/HBASE-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-6406: - Fix Version/s: (was: 0.94.2) 0.94.1 0.96.0 TestReplicationPeer.testResetZooKeeperSession and TestZooKeeper.testClientSessionExpired fail frequently Key: HBASE-6406 URL: https://issues.apache.org/jira/browse/HBASE-6406 Project: HBase Issue Type: Bug Affects Versions: 0.94.1 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.1 Attachments: 6406.txt, testReplication.jstack, testZooKeeper.jstack Looking back through the 0.94 test runs these two tests accounted for 11 of 34 failed tests. They should be fixed or (temporarily) disabled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5498) Secure Bulk Load
[ https://issues.apache.org/jira/browse/HBASE-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francis Liu updated HBASE-5498: --- Attachment: HBASE-5498_draft_94.patch Laxman, here's a working patch. It incorporates HBASE-6432 which took some time debugging. I still have to address the other comments, some cleanup and TODOs. Let me know if this works for you. Secure Bulk Load Key: HBASE-5498 URL: https://issues.apache.org/jira/browse/HBASE-5498 Project: HBase Issue Type: Improvement Components: mapred, security Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.96.0 Attachments: HBASE-5498_draft.patch, HBASE-5498_draft_94.patch Design doc: https://cwiki.apache.org/confluence/display/HCATALOG/HBase+Secure+Bulk+Load Short summary: Security as it stands does not cover the bulkLoadHFiles() feature. Users calling this method will bypass ACLs. Also loading is made more cumbersome in a secure setting because of hdfs privileges. bulkLoadHFiles() moves the data from user's directory to the hbase directory, which would require certain write access privileges set. Our solution is to create a coprocessor which makes use of AuthManager to verify if a user has write access to the table. If so, launches a MR job as the hbase user to do the importing (ie rewrite from text to hfiles). One tricky part this job will have to do is impersonate the calling user when reading the input files. We can do this by expecting the user to pass an hdfs delegation token as part of the secureBulkLoad() coprocessor call and extend an inputformat to make use of that token. The output is written to a temporary directory accessible only by hbase and then bulkloadHFiles() is called. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6431) Some FilterList Constructors break addFilter
[ https://issues.apache.org/jira/browse/HBASE-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418931#comment-13418931 ] Hadoop QA commented on HBASE-6431: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537269/0001-HBASE-6431.-Some-FilterList-Constructors-break-addFi.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2417//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2417//console This message is automatically generated. Some FilterList Constructors break addFilter Key: HBASE-6431 URL: https://issues.apache.org/jira/browse/HBASE-6431 Project: HBase Issue Type: Bug Components: filters Affects Versions: 0.92.1, 0.94.0 Reporter: Alex Newman Assignee: Alex Newman Priority: Minor Attachments: 0001-HBASE-6431.-Some-FilterList-Constructors-break-addFi.patch Some of the constructors for FilterList set the internal list of filters to list types which don't support the add operation. As a result FilterList(final ListFilter rowFilters) FilterList(final Filter... rowFilters) FilterList(final Operator operator, final ListFilter rowFilters) FilterList(final Operator operator, final Filter... rowFilters) may init private ListFilter filters = new ArrayListFilter(); incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6432) HRegionServer doesn't properly set clusterId in conf
[ https://issues.apache.org/jira/browse/HBASE-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francis Liu updated HBASE-6432: --- Description: ClusterId is normally set into the passed conf during instantiation of an HTable class. In the case of a HRegionServer this is bypassed and set to default since getMaster() since it uses HBaseRPC to create the proxy directly and bypasses the class which retrieves and sets the correct clusterId. This becomes a problem with clients (ie within a coprocessor) using delegation tokens for authentication. Since the token's service will be the correct clusterId and while the TokenSelector is looking for one with service default. was: ClusterId is normally set into the passed conf during instantiation of an HTable class. In the case of a HRegionServer this is bypassed and set to default since getMaster() bypasses the class which sets clusterID clusterId since it uses HBaseRPC to create the proxy to create the proxy directly. This becomes a problem with clients (ie within a coprocessor) using delegation tokens for authentication. Since the token's service will be the correct clusterId and while the TokenSelector is looking for one with service default. HRegionServer doesn't properly set clusterId in conf Key: HBASE-6432 URL: https://issues.apache.org/jira/browse/HBASE-6432 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.96.0 Attachments: HBASE-6432_94.patch ClusterId is normally set into the passed conf during instantiation of an HTable class. In the case of a HRegionServer this is bypassed and set to default since getMaster() since it uses HBaseRPC to create the proxy directly and bypasses the class which retrieves and sets the correct clusterId. This becomes a problem with clients (ie within a coprocessor) using delegation tokens for authentication. Since the token's service will be the correct clusterId and while the TokenSelector is looking for one with service default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3725) HBase increments from old value after delete and write to disk
[ https://issues.apache.org/jira/browse/HBASE-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418933#comment-13418933 ] ShiXing commented on HBASE-3725: @Ted, the reassignment is because there is no interface to set the iscan back to both memstore and filestore, because at the begining, the iscan is set memstore {code} // memstore scan iscan.checkOnlyMemStore(); {code} HBase increments from old value after delete and write to disk -- Key: HBASE-3725 URL: https://issues.apache.org/jira/browse/HBASE-3725 Project: HBase Issue Type: Bug Components: io, regionserver Affects Versions: 0.90.1 Reporter: Nathaniel Cook Assignee: Jonathan Gray Attachments: HBASE-3725-0.92-V1.patch, HBASE-3725-0.92-V2.patch, HBASE-3725-0.92-V3.patch, HBASE-3725-0.92-V4.patch, HBASE-3725-0.92-V5.patch, HBASE-3725-Test-v1.patch, HBASE-3725-v3.patch, HBASE-3725.patch Deleted row values are sometimes used for starting points on new increments. To reproduce: Create a row r. Set column x to some default value. Force hbase to write that value to the file system (such as restarting the cluster). Delete the row. Call table.incrementColumnValue with some_value Get the row. The returned value in the column was incremented from the old value before the row was deleted instead of being initialized to some_value. Code to reproduce: {code} import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Delete; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.HTablePool; import org.apache.hadoop.hbase.client.Increment; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.util.Bytes; public class HBaseTestIncrement { static String tableName = testIncrement; static byte[] infoCF = Bytes.toBytes(info); static byte[] rowKey = Bytes.toBytes(test-rowKey); static byte[] newInc = Bytes.toBytes(new); static byte[] oldInc = Bytes.toBytes(old); /** * This code reproduces a bug with increment column values in hbase * Usage: First run part one by passing '1' as the first arg *Then restart the hbase cluster so it writes everything to disk *Run part two by passing '2' as the first arg * * This will result in the old deleted data being found and used for the increment calls * * @param args * @throws IOException */ public static void main(String[] args) throws IOException { if(1.equals(args[0])) partOne(); if(2.equals(args[0])) partTwo(); if (both.equals(args[0])) { partOne(); partTwo(); } } /** * Creates a table and increments a column value 10 times by 10 each time. * Results in a value of 100 for the column * * @throws IOException */ static void partOne()throws IOException { Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); HTableDescriptor tableDesc = new HTableDescriptor(tableName); tableDesc.addFamily(new HColumnDescriptor(infoCF)); if(admin.tableExists(tableName)) { admin.disableTable(tableName); admin.deleteTable(tableName); } admin.createTable(tableDesc); HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE); HTableInterface table = pool.getTable(Bytes.toBytes(tableName)); //Increment unitialized column for (int j = 0; j 10; j++) { table.incrementColumnValue(rowKey, infoCF, oldInc, (long)10); Increment inc = new Increment(rowKey); inc.addColumn(infoCF, newInc, (long)10); table.increment(inc); } Get get = new Get(rowKey); Result r = table.get(get); System.out.println(initial values: new + Bytes.toLong(r.getValue(infoCF, newInc)) + old + Bytes.toLong(r.getValue(infoCF, oldInc))); } /** * First deletes the data then increments the column 10 times by 1
[jira] [Commented] (HBASE-6406) TestReplicationPeer.testResetZooKeeperSession and TestZooKeeper.testClientSessionExpired fail frequently
[ https://issues.apache.org/jira/browse/HBASE-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418936#comment-13418936 ] Lars Hofhansl commented on HBASE-6406: -- TestZooKeeper.testClientSessionExpired failed again in latest 0.94 build. Although this is not obvious from the logs the pattern in the code is that same as in TestReplicationPeer. My initial suspicion was RecoverableZooKeeper and that it somehow retries the operation and thereby reconnects the expired session. According to the code it does not do that, though. Somehow HBaseTestingUtil.expireSession is subject to racing. In the case of TestReplicationPeer that happened when expireSession is called before the connection was actually established. Is there a way to check whether the connection was established first and wait if it wasn't? Otherwise, I'd say we disable this test for now. TestReplicationPeer.testResetZooKeeperSession and TestZooKeeper.testClientSessionExpired fail frequently Key: HBASE-6406 URL: https://issues.apache.org/jira/browse/HBASE-6406 Project: HBase Issue Type: Bug Affects Versions: 0.94.1 Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.1 Attachments: 6406.txt, testReplication.jstack, testZooKeeper.jstack Looking back through the 0.94 test runs these two tests accounted for 11 of 34 failed tests. They should be fixed or (temporarily) disabled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6429) Filter with filterRow() returning true is also incompatible with scan with limit
[ https://issues.apache.org/jira/browse/HBASE-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang updated HBASE-6429: - Attachment: hbase-6429-trunk.patch 1. Prepare a patch against trunk 2. Add one more unit test case (TestFilterWithScanLimits) 3. Fix 2 unit test failures in the previous version. Filter with filterRow() returning true is also incompatible with scan with limit Key: HBASE-6429 URL: https://issues.apache.org/jira/browse/HBASE-6429 Project: HBase Issue Type: Bug Components: filters Affects Versions: 0.96.0 Reporter: Jason Dai Attachments: hbase-6429-trunk.patch, hbase-6429_0_94_0.patch Currently if we scan with bot limit and a Filter with filterRow(ListKeyValue) implemented, an IncompatibleFilterException will be thrown. The same exception should also be thrown if the filer has its filterRow() implemented. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6363) HBaseConfiguration can carry a main method that dumps XML output for debug purposes
[ https://issues.apache.org/jira/browse/HBASE-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418940#comment-13418940 ] Shengsheng Huang commented on HBASE-6363: - Seems reasonable. I only got a little bit concern about package dependency, because some of our customers are very reluctant to upgrade their stable hadoop deployment. A standalone patch is good to have. HBaseConfiguration can carry a main method that dumps XML output for debug purposes --- Key: HBASE-6363 URL: https://issues.apache.org/jira/browse/HBASE-6363 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.94.0 Reporter: Harsh J Priority: Trivial Labels: conf, newbie, noob Attachments: HBASE-6363.2.patch, HBASE-6363.patch Just like the Configuration class carries a main() method in it, that simply loads itself and writes XML out to System.out, HBaseConfiguration can use the same kinda method. That way we can do hbase org.apache.hadoop.….HBaseConfiguration to get an Xml dump of things HBaseConfiguration has properly loaded. Nifty in checking app classpaths sometimes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira