[jira] [Commented] (HBASE-6134) Improvement for split-worker to speed up distributed log splitting
[ https://issues.apache.org/jira/browse/HBASE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406299#comment-13406299 ] stack commented on HBASE-6134: -- Ok. Then it was a pigment of my emancipation that you had. You fellas fix so much, seemed possible. Improvement for split-worker to speed up distributed log splitting -- Key: HBASE-6134 URL: https://issues.apache.org/jira/browse/HBASE-6134 Project: HBase Issue Type: Improvement Components: wal Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.96.0 Attachments: 6134v4.patch, HBASE-6134.patch, HBASE-6134v2.patch, HBASE-6134v3-92.patch, HBASE-6134v3.patch, HBASE-6134v4-94.patch, HBASE-6134v4.patch First,we do the test between local-master-splitting and distributed-log-splitting Environment:34 hlog files, 5 regionservers,(after kill one, only 4 rs do ths splitting work), 400 regions in one hlog file local-master-split:60s+ distributed-log-splitting:165s+ In fact, in our production environment, distributed-log-splitting also took 60s with 30 regionservers for 34 hlog files (regionserver may be in high load) We found split-worker split one log file took about 20s (30ms~50ms per writer.close(); 10ms per create writers ) I think we could do the improvement for this: Parallelizing the create and close writers in threads In the patch, change the logic for distributed-log-splitting same as the local-master-splitting and parallelizing the close in threads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6283) [region_mover.rb] Add option to exclude list of hosts on unload instead of just assuming the source node.
[ https://issues.apache.org/jira/browse/HBASE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406301#comment-13406301 ] stack commented on HBASE-6283: -- bq. Thanks for the pointer to Aravind's work – this is the first I've seen the blog. Have we encouraged Aravind to contribute his work? He has contrib'd the non-SU stuff: i.e. the bit where can register in zk what regionservers are being rolled. [region_mover.rb] Add option to exclude list of hosts on unload instead of just assuming the source node. - Key: HBASE-6283 URL: https://issues.apache.org/jira/browse/HBASE-6283 Project: HBase Issue Type: Improvement Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Labels: jruby Attachments: hbase-6283.patch Currently, the region_mover.rb script excludes a single host, the host offloading data, as a region move target. This essentially limits the number of machines that can be shutdown at a time to one. For larger clusters, it is manageable to have several nodes down at a time and desirable to get this process done more quickly. The proposed patch adds an exclude file option, that allows multiple hosts to be excluded as targets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406305#comment-13406305 ] stack commented on HBASE-6299: -- It looks to me like we have same issue in trunk. Your suggested fix looks right Maryann. Put up a patch and I'll have a go at making a unit test for it. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying
[jira] [Commented] (HBASE-6326) Nested retry loops in HConnectionManager
[ https://issues.apache.org/jira/browse/HBASE-6326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406308#comment-13406308 ] stack commented on HBASE-6326: -- +1 Simple but ugly. Good enough for a 0.94.1. Nested retry loops in HConnectionManager Key: HBASE-6326 URL: https://issues.apache.org/jira/browse/HBASE-6326 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Priority: Critical Fix For: 0.94.1 Attachments: 6326.txt While testing client timeouts when the HBase is not available we found that even with aggressive settings, it takes the client 10 minutes or more to finally receive an exception. Part of this is due to nested nested retry loops in locateRegion. locateRegion will first try to locate the table in meta (which is retried), then it will try to locate the meta table is root (which is also retried). So for each retry of the meta lookup we retry the root lookup as well. I have have that avoids locateRegion retrying if it is called from code that already has a retry loop. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6309) [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
[ https://issues.apache.org/jira/browse/HBASE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406310#comment-13406310 ] stack commented on HBASE-6309: -- @Chunhui What about case where we fail a log splitting... how would the cleanup go? If into a tmp dir, its easy remove the tmp dir (Otherwise, sounds like a fine idea). [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager Key: HBASE-6309 URL: https://issues.apache.org/jira/browse/HBASE-6309 Project: HBase Issue Type: Improvement Affects Versions: 0.92.1, 0.94.0, 0.96.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.96.0 We found this issue during the leap second cataclysm which prompted a distributed splitting of all our logs. I saw that none of the RS were splitting after some time while the master was showing that it wasn't even 30% done. jstack'ing I saw this: {noformat} main-EventThread daemon prio=10 tid=0x7f6ce46d8800 nid=0x5376 in Object.wait() [0x7f6ce2ecb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1093) - locked 0x0005fdd661a0 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy9.rename(Unknown Source) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy9.rename(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:759) at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:253) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:553) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:519) at org.apache.hadoop.hbase.master.SplitLogManager$1.finish(SplitLogManager.java:138) at org.apache.hadoop.hbase.master.SplitLogManager.getDataSetWatchSuccess(SplitLogManager.java:431) at org.apache.hadoop.hbase.master.SplitLogManager.access$1200(SplitLogManager.java:95) at org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback.processResult(SplitLogManager.java:1011) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:571) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) {noformat} We are effectively bottlenecking on doing NN operations and whatever else is happening in GetDataAsyncCallback. It was so bad that on our 100 offline cluster it took a few hours for the master to process all the incoming ZK events while the actual splitting took a fraction of that time. I'm marking this as critical and against 0.96 but depending on how involved the fix is we might want to backport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6309) [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
[ https://issues.apache.org/jira/browse/HBASE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406316#comment-13406316 ] chunhui shen commented on HBASE-6309: - bq.how would the cleanup go? In HLogSplitter#createWAP {code} if ((tmpname == null) fs.exists(regionedits)) { LOG.warn(Found existing old edits file. It could be the + result of a previous failed split attempt. Deleting + regionedits + , length= + fs.getFileStatus(regionedits).getLen()); if (!fs.delete(regionedits, false)) { LOG.warn(Failed delete of old + regionedits); } } {code} We could also fail a log splitting if using master-local-splitting, the clean up happen in the next splitting as per the above code [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager Key: HBASE-6309 URL: https://issues.apache.org/jira/browse/HBASE-6309 Project: HBase Issue Type: Improvement Affects Versions: 0.92.1, 0.94.0, 0.96.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.96.0 We found this issue during the leap second cataclysm which prompted a distributed splitting of all our logs. I saw that none of the RS were splitting after some time while the master was showing that it wasn't even 30% done. jstack'ing I saw this: {noformat} main-EventThread daemon prio=10 tid=0x7f6ce46d8800 nid=0x5376 in Object.wait() [0x7f6ce2ecb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1093) - locked 0x0005fdd661a0 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy9.rename(Unknown Source) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy9.rename(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:759) at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:253) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:553) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:519) at org.apache.hadoop.hbase.master.SplitLogManager$1.finish(SplitLogManager.java:138) at org.apache.hadoop.hbase.master.SplitLogManager.getDataSetWatchSuccess(SplitLogManager.java:431) at org.apache.hadoop.hbase.master.SplitLogManager.access$1200(SplitLogManager.java:95) at org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback.processResult(SplitLogManager.java:1011) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:571) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) {noformat} We are effectively bottlenecking on doing NN operations and whatever else is happening in GetDataAsyncCallback. It was so bad that on our 100 offline cluster it took a few hours for the master to process all the incoming ZK events while the actual splitting took a fraction of that time. I'm marking this as critical and against 0.96 but depending on how involved the fix is we might want to backport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5450) Support for wire-compatibility in inter-cluster replication (ZK, etc)
[ https://issues.apache.org/jira/browse/HBASE-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406315#comment-13406315 ] stack commented on HBASE-5450: -- @Chris See HBASE-5965. Looks like I abandoned it on the last length. Looks like it needs some polish to get it over the finish line. If you are on for it, be my guest. Thanks. Support for wire-compatibility in inter-cluster replication (ZK, etc) - Key: HBASE-5450 URL: https://issues.apache.org/jira/browse/HBASE-5450 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Todd Lipcon Assignee: Chris Trezzo -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: HBASE-6299-v2.patch Make handling of RegionAlreadyInTransitionException work. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to
[jira] [Commented] (HBASE-5705) Introduce Protocol Buffer RPC engine
[ https://issues.apache.org/jira/browse/HBASE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406330#comment-13406330 ] Devaraj Das commented on HBASE-5705: Thanks for looking at the patch, Ted. I'll update it soon. Introduce Protocol Buffer RPC engine Key: HBASE-5705 URL: https://issues.apache.org/jira/browse/HBASE-5705 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Devaraj Das Assignee: Devaraj Das Attachments: 5705-1.patch Introduce Protocol Buffer RPC engine in the RPC core. Protocols that are PB aware can be made to go through this RPC engine. The approach, in my current thinking, would be similar to HADOOP-7773. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Status: Patch Available (was: Open) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception:
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Status: Open (was: Patch Available) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception:
[jira] [Commented] (HBASE-6309) [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
[ https://issues.apache.org/jira/browse/HBASE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406345#comment-13406345 ] stack commented on HBASE-6309: -- What about the logic in moveRecoveredEditsFromTemp? It flags corrupted logs and does some other cleanup. Also seems to find recovered.edits files with a .corrupt ending: see ZKSlitLog.isCorruptFlagFile That'd need refactoring and a rename from moveRecoveredEditsFromTemp to 'completeLogSplit' or 'finish'? Otherwise, looking through HLogSplitting and trying to recall issues we've run into w/ recovered.edits, I think doing it in place can work. Would suggest you look at the region open and replay of recovered.edits stuff too to see if you see any possible issues there (I only went through HLogSplitting). (That renaming stuff is pretty heavy duty stuffbut I'd have done the same to cordon off a distributed operation) Good stuff Chunhui. [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager Key: HBASE-6309 URL: https://issues.apache.org/jira/browse/HBASE-6309 Project: HBase Issue Type: Improvement Affects Versions: 0.92.1, 0.94.0, 0.96.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.96.0 We found this issue during the leap second cataclysm which prompted a distributed splitting of all our logs. I saw that none of the RS were splitting after some time while the master was showing that it wasn't even 30% done. jstack'ing I saw this: {noformat} main-EventThread daemon prio=10 tid=0x7f6ce46d8800 nid=0x5376 in Object.wait() [0x7f6ce2ecb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1093) - locked 0x0005fdd661a0 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy9.rename(Unknown Source) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy9.rename(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:759) at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:253) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:553) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:519) at org.apache.hadoop.hbase.master.SplitLogManager$1.finish(SplitLogManager.java:138) at org.apache.hadoop.hbase.master.SplitLogManager.getDataSetWatchSuccess(SplitLogManager.java:431) at org.apache.hadoop.hbase.master.SplitLogManager.access$1200(SplitLogManager.java:95) at org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback.processResult(SplitLogManager.java:1011) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:571) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) {noformat} We are effectively bottlenecking on doing NN operations and whatever else is happening in GetDataAsyncCallback. It was so bad that on our 100 offline cluster it took a few hours for the master to process all the incoming ZK events while the actual splitting took a fraction of that time. I'm marking this as critical and against 0.96 but depending on how involved the fix is we might want to backport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6309) [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
[ https://issues.apache.org/jira/browse/HBASE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406353#comment-13406353 ] ramkrishna.s.vasudevan commented on HBASE-6309: --- Currently there are 3 renames in this path. The one that renames the temp to recovered.edits path and the next is in archive logs. Here there are 2 one for corrupted ones and the other for archived path. In between there are lot of deletes and exists call. I think we can reduce no of NN operations. How costly is delete and exists check? I will check on this more. [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager Key: HBASE-6309 URL: https://issues.apache.org/jira/browse/HBASE-6309 Project: HBase Issue Type: Improvement Affects Versions: 0.92.1, 0.94.0, 0.96.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.96.0 We found this issue during the leap second cataclysm which prompted a distributed splitting of all our logs. I saw that none of the RS were splitting after some time while the master was showing that it wasn't even 30% done. jstack'ing I saw this: {noformat} main-EventThread daemon prio=10 tid=0x7f6ce46d8800 nid=0x5376 in Object.wait() [0x7f6ce2ecb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1093) - locked 0x0005fdd661a0 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy9.rename(Unknown Source) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy9.rename(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:759) at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:253) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:553) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:519) at org.apache.hadoop.hbase.master.SplitLogManager$1.finish(SplitLogManager.java:138) at org.apache.hadoop.hbase.master.SplitLogManager.getDataSetWatchSuccess(SplitLogManager.java:431) at org.apache.hadoop.hbase.master.SplitLogManager.access$1200(SplitLogManager.java:95) at org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback.processResult(SplitLogManager.java:1011) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:571) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) {noformat} We are effectively bottlenecking on doing NN operations and whatever else is happening in GetDataAsyncCallback. It was so bad that on our 100 offline cluster it took a few hours for the master to process all the incoming ZK events while the actual splitting took a fraction of that time. I'm marking this as critical and against 0.96 but depending on how involved the fix is we might want to backport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6306) TestFSUtils fails against hadoop 2.0
[ https://issues.apache.org/jira/browse/HBASE-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406384#comment-13406384 ] ramkrishna.s.vasudevan commented on HBASE-6306: --- In 0.94 this is not there right, Jon? Because for us this testcase in 0.94 passes on hadoop2.0. TestFSUtils fails against hadoop 2.0 Key: HBASE-6306 URL: https://issues.apache.org/jira/browse/HBASE-6306 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.96.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Fix For: 0.96.0 Attachments: hbase-6306-trunk.patch trunk: mvn clean test -Dhadoop.profile=2.0 -Dtest=TestFSUtils {code} java.io.FileNotFoundException: File /home/jon/proj/hbase-trunk/hbase-server/target/test-data/02beb8c8-06c1-47ea-829b-6e7ce0570cf8/hbase.version does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:315) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1279) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1319) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:557) at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:213) at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:270) at org.apache.hadoop.hbase.util.TestFSUtils.testVersion(TestFSUtils.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6313) Client hangs because the client is not notified
[ https://issues.apache.org/jira/browse/HBASE-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] binlijin updated HBASE-6313: Attachment: HBASE-6313-0.94-2.patch Client hangs because the client is not notified Key: HBASE-6313 URL: https://issues.apache.org/jira/browse/HBASE-6313 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: binlijin Fix For: 0.94.1 Attachments: HBASE-6313-0.92-2.patch, HBASE-6313-0.92.patch, HBASE-6313-0.94-2.patch, HBASE-6313-0.94.patch, HBASE-6313-trunk.patch, clienthangthread.out If the call first remove from the calls, when some exception happened in reading from the DataInputStream, the call will not be notified, cause the client hangs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6313) Client hangs because the client is not notified
[ https://issues.apache.org/jira/browse/HBASE-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] binlijin updated HBASE-6313: Attachment: HBASE-6313-0.92-3.patch Client hangs because the client is not notified Key: HBASE-6313 URL: https://issues.apache.org/jira/browse/HBASE-6313 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: binlijin Fix For: 0.94.1 Attachments: HBASE-6313-0.92-2.patch, HBASE-6313-0.92-3.patch, HBASE-6313-0.92.patch, HBASE-6313-0.94-2.patch, HBASE-6313-0.94.patch, HBASE-6313-trunk.patch, clienthangthread.out If the call first remove from the calls, when some exception happened in reading from the DataInputStream, the call will not be notified, cause the client hangs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6313) Client hangs because the client is not notified
[ https://issues.apache.org/jira/browse/HBASE-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] binlijin updated HBASE-6313: Attachment: HBASE-6313-trunk-2.patch Client hangs because the client is not notified Key: HBASE-6313 URL: https://issues.apache.org/jira/browse/HBASE-6313 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: binlijin Fix For: 0.94.1 Attachments: HBASE-6313-0.92-2.patch, HBASE-6313-0.92-3.patch, HBASE-6313-0.92.patch, HBASE-6313-0.94-2.patch, HBASE-6313-0.94.patch, HBASE-6313-trunk-2.patch, HBASE-6313-trunk.patch, clienthangthread.out If the call first remove from the calls, when some exception happened in reading from the DataInputStream, the call will not be notified, cause the client hangs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit
[ https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406429#comment-13406429 ] nkeywal commented on HBASE-4955: Update: Still waiting. There is some life on Surefire, for JUnit there won't be anything before Q4 I guess. Use the official versions of surefire junit - Key: HBASE-4955 URL: https://issues.apache.org/jira/browse/HBASE-4955 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor We currently use private versions for Surefire JUnit since HBASE-4763. This JIRA traks what we need to move to official versions. Surefire 2.11 is just out, but, after some tests, it does not contain all what we need. JUnit. Could be for JUnit 4.11. Issue to monitor: https://github.com/KentBeck/junit/issues/359: fixed in our version, no feedback for an integration on trunk Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on our version. 760 (does not take into account the test method): fixed in trunk, not fixed in our version 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): not fixed in trunk, not fixed in our version 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed on our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6309) [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
[ https://issues.apache.org/jira/browse/HBASE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406434#comment-13406434 ] nkeywal commented on HBASE-6309: bq. How costly is delete and exists check? A remote call to the NN, but no socket creation (it's persistent). There is no cache on the client side, so all exists calls will do the network loop. Exists is pretty fast (not much more cost than the network roundtrip), but it adds a little something to the NN and network workload that can be already high when there is a major failure... [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager Key: HBASE-6309 URL: https://issues.apache.org/jira/browse/HBASE-6309 Project: HBase Issue Type: Improvement Affects Versions: 0.92.1, 0.94.0, 0.96.0 Reporter: Jean-Daniel Cryans Priority: Critical Fix For: 0.96.0 We found this issue during the leap second cataclysm which prompted a distributed splitting of all our logs. I saw that none of the RS were splitting after some time while the master was showing that it wasn't even 30% done. jstack'ing I saw this: {noformat} main-EventThread daemon prio=10 tid=0x7f6ce46d8800 nid=0x5376 in Object.wait() [0x7f6ce2ecb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1093) - locked 0x0005fdd661a0 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy9.rename(Unknown Source) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy9.rename(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:759) at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:253) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:553) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:519) at org.apache.hadoop.hbase.master.SplitLogManager$1.finish(SplitLogManager.java:138) at org.apache.hadoop.hbase.master.SplitLogManager.getDataSetWatchSuccess(SplitLogManager.java:431) at org.apache.hadoop.hbase.master.SplitLogManager.access$1200(SplitLogManager.java:95) at org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback.processResult(SplitLogManager.java:1011) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:571) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) {noformat} We are effectively bottlenecking on doing NN operations and whatever else is happening in GetDataAsyncCallback. It was so bad that on our 100 offline cluster it took a few hours for the master to process all the incoming ZK events while the actual splitting took a fraction of that time. I'm marking this as critical and against 0.96 but depending on how involved the fix is we might want to backport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6327) HLog can be null when create table
ShiXing created HBASE-6327: -- Summary: HLog can be null when create table Key: HBASE-6327 URL: https://issues.apache.org/jira/browse/HBASE-6327 Project: HBase Issue Type: Bug Reporter: ShiXing Assignee: ShiXing Attachments: createTableFailedMaster.log As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,all the regions for the table can also be assigned. I will give the patch for this later. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6327) HLog can be null when create table
[ https://issues.apache.org/jira/browse/HBASE-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShiXing updated HBASE-6327: --- Attachment: createTableFailedMaster.log HLog can be null when create table -- Key: HBASE-6327 URL: https://issues.apache.org/jira/browse/HBASE-6327 Project: HBase Issue Type: Bug Reporter: ShiXing Assignee: ShiXing Attachments: createTableFailedMaster.log As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,all the regions for the table can also be assigned. I will give the patch for this later. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6272) In-memory region state is inconsistent
[ https://issues.apache.org/jira/browse/HBASE-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406445#comment-13406445 ] ramkrishna.s.vasudevan commented on HBASE-6272: --- Some comments (should say questions) added in RB. Thanks. In-memory region state is inconsistent -- Key: HBASE-6272 URL: https://issues.apache.org/jira/browse/HBASE-6272 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang AssignmentManger stores region state related information in several places: regionsInTransition, regions (region info to server name map), and servers (server name to region info set map). However the access to these places is not coordinated properly. It leads to inconsistent in-memory region state information. Sometimes, some region could even be offline, and not in transition. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5876) TestImportExport has been failing against hadoop 0.23 profile
[ https://issues.apache.org/jira/browse/HBASE-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406455#comment-13406455 ] ramkrishna.s.vasudevan commented on HBASE-5876: --- I tried running the patch for 0.94 on hadoop 2.0. It passed(but not much aware of the changes). :) TestImportExport has been failing against hadoop 0.23 profile - Key: HBASE-5876 URL: https://issues.apache.org/jira/browse/HBASE-5876 Project: HBase Issue Type: Bug Affects Versions: 0.94.0, 0.96.0 Reporter: Zhihong Ted Yu Assignee: Jonathan Hsieh Fix For: 0.96.0, 0.94.1 Attachments: hbase-5876-94-v3.patch, hbase-5876-94.patch, hbase-5876-trunk-v3.patch, hbase-5876-v2.patch, hbase-5876.patch TestImportExport has been failing against hadoop 0.23 profile -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406476#comment-13406476 ] ramkrishna.s.vasudevan commented on HBASE-6299: --- @Maryann We just checked over here in 0.94. {code} if (t instanceof RegionAlreadyInTransitionException) { String errorMsg = Failed assignment in: + plan.getDestination() + due to + t.getMessage(); LOG.error(errorMsg, t); return; } {code} This piece of code is correct. If we directly check instancof it doesn't match. Thanks.. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406533#comment-13406533 ] stack commented on HBASE-6299: -- bq. This piece of code is correct. If we directly check instancof it doesn't match. Thanks.. Is it correct or incorrect Ram? I'm not sure going by the above. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575,
[jira] [Commented] (HBASE-6319) ReplicationSource can call terminate on itself and deadlock
[ https://issues.apache.org/jira/browse/HBASE-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406539#comment-13406539 ] stack commented on HBASE-6319: -- How does 'this' get shutdown then? ReplicationSource can call terminate on itself and deadlock --- Key: HBASE-6319 URL: https://issues.apache.org/jira/browse/HBASE-6319 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.7, 0.92.2, 0.94.2 Attachments: HBASE-6319-0.92.patch In a few places in the ReplicationSource code calls terminate on itself which is a problem since in terminate() we wait on that thread to die. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6312) Make BlockCache eviction thresholds configurable
[ https://issues.apache.org/jira/browse/HBASE-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406553#comment-13406553 ] Jason Dai commented on HBASE-6312: -- If we do not expose the acceptable factor and minimum factor, the hfile.block.cache.size parameter can be very confusing (for the user to properly configure the cache size behavior). Just as J-D mentioned, if the user want 2GB cache, he needs to set the parameter to ~2.35GB, and he needs to understand the HBase implementation details to do that. This looks a lot like hacking, not a user friendly interface. Maybe We should evict only after cache size is large than hfile.block.cache.size, and allow ~15% burstiness before blocking. Make BlockCache eviction thresholds configurable Key: HBASE-6312 URL: https://issues.apache.org/jira/browse/HBASE-6312 Project: HBase Issue Type: Improvement Components: io Affects Versions: 0.94.0 Reporter: Jie Huang Priority: Minor Attachments: hbase-6312.patch Some of our customers found that tuning the BlockCache eviction thresholds made test results different in their test environment. However, those thresholds are not configurable in the current implementation. The only way to change those values is to re-compile the HBase source code. We wonder if it is possible to make them configurable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5705) Introduce Protocol Buffer RPC engine
[ https://issues.apache.org/jira/browse/HBASE-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406557#comment-13406557 ] stack commented on HBASE-5705: -- I added some comments up in RB. Seems like pb stuff goes via Writables still? Would be nice if I did not have to read hadoop-7773 patch to figure out what this change is doing. Any chance of a sentence or two on intent? Good stuff DD. Introduce Protocol Buffer RPC engine Key: HBASE-5705 URL: https://issues.apache.org/jira/browse/HBASE-5705 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Devaraj Das Assignee: Devaraj Das Attachments: 5705-1.patch Introduce Protocol Buffer RPC engine in the RPC core. Protocols that are PB aware can be made to go through this RPC engine. The approach, in my current thinking, would be similar to HADOOP-7773. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6325) [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive
[ https://issues.apache.org/jira/browse/HBASE-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406562#comment-13406562 ] stack commented on HBASE-6325: -- Below returns true if we added regionservers. What if we are adding regionservers we already had in otherRegionServers (how do I know only a newRsList is returned? Because called on construction and nodeCreated?) {code} /** + * Reads the list of region servers from ZK and updates the + * local view of it + * @return true if the update was successful, else false + */ + private boolean refreshOtherRegionServersList() { +ListString newRsList = zkHelper.getRegisteredRegionServers(); +if (newRsList == null) { + return false; +} else { + synchronized (otherRegionServers) { +otherRegionServers.clear(); +otherRegionServers.addAll(newRsList); + } +} +return true; + } {code} This synchronize is not needed anymore since its done inside in refreshOtherRegionServersList? {code} synchronized (otherRegionServers) { + refreshOtherRegionServersList(); LOG.info(Current list of replicators: + currentReplicators + other RSs: + otherRegionServers); } {code} [replication] Race in ReplicationSourceManager.init can initiate a failover even if the node is alive - Key: HBASE-6325 URL: https://issues.apache.org/jira/browse/HBASE-6325 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.92.1, 0.94.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.7, 0.92.2, 0.96.0, 0.94.2 Attachments: HBASE-6325-0.92.patch Yet another bug found during the leap second madness, it's possible to miss the registration of new region servers so that in ReplicationSourceManager.init we start the failover of a live and replicating region server. I don't think there's data loss but the RS that's being failed over will die on: {noformat} 2012-07-01 06:25:15,604 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server sv4r23s48,10304,1341112194623: Writing replication status org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/sv4r23s48,10304,1341112194623/4/sv4r23s48%2C10304%2C1341112194623.1341112195369 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:372) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:655) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:697) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:470) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:607) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:368) {noformat} It seems to me that just refreshing {{otherRegionServers}} after getting the list of {{currentReplicators}} would be enough to fix this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6322) Unnecessary creation of finalizers in HTablePool
[ https://issues.apache.org/jira/browse/HBASE-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-6322: - Attachment: HBASE-6322-trunk.1.patch What I applied to trunk. Unnecessary creation of finalizers in HTablePool Key: HBASE-6322 URL: https://issues.apache.org/jira/browse/HBASE-6322 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Ryan Brush Attachments: HBASE-6322-0.92.1.patch, HBASE-6322-trunk.1.patch From a mailing list question: While generating some load against a library that makes extensive use of HTablePool in 0.92, I noticed that the largest heap consumer was java.lang.ref.Finalizer. Digging in, I discovered that HTablePool's internal PooledHTable extends HTable, which instantiates a ThreadPoolExecutor and supporting objects every time a pooled HTable is retrieved. Since ThreadPoolExecutor has a finalizer, it and its dependencies can't get garbage collected until the finalizer runs. The result is by using HTablePool, we're creating a ton of objects to be finalized that are stuck on the heap longer than they should be, creating our largest source of pressure on the garbage collector. It looks like this will also be a problem in 0.94 and trunk. The easy fix is just to have PooledHTable implement HTableInterface (rather than subclass HTable), but this does break a unit test that explicitly checks that PooledHTable implements HTable -- I can only assume this test is there for some historical passivity reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
nkeywal created HBASE-6328: -- Summary: FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Priority: Minor Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-6322) Unnecessary creation of finalizers in HTablePool
[ https://issues.apache.org/jira/browse/HBASE-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-6322. -- Resolution: Fixed Fix Version/s: 0.92.2 Hadoop Flags: Reviewed Applied to 0.92. Thanks for the patch Ryan. Do we need something like this on 0.94 and trunk too? Unnecessary creation of finalizers in HTablePool Key: HBASE-6322 URL: https://issues.apache.org/jira/browse/HBASE-6322 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Ryan Brush Fix For: 0.92.2 Attachments: HBASE-6322-0.92.1.patch, HBASE-6322-trunk.1.patch From a mailing list question: While generating some load against a library that makes extensive use of HTablePool in 0.92, I noticed that the largest heap consumer was java.lang.ref.Finalizer. Digging in, I discovered that HTablePool's internal PooledHTable extends HTable, which instantiates a ThreadPoolExecutor and supporting objects every time a pooled HTable is retrieved. Since ThreadPoolExecutor has a finalizer, it and its dependencies can't get garbage collected until the finalizer runs. The result is by using HTablePool, we're creating a ton of objects to be finalized that are stuck on the heap longer than they should be, creating our largest source of pressure on the garbage collector. It looks like this will also be a problem in 0.94 and trunk. The easy fix is just to have PooledHTable implement HTableInterface (rather than subclass HTable), but this does break a unit test that explicitly checks that PooledHTable implements HTable -- I can only assume this test is there for some historical passivity reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6281) Assignment need not be called for disabling table regions during clean cluster start up.
[ https://issues.apache.org/jira/browse/HBASE-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-6281: - Resolution: Fixed Status: Resolved (was: Patch Available) Committed to 0.92. Thanks for patch Rajesh. Assignment need not be called for disabling table regions during clean cluster start up. Key: HBASE-6281 URL: https://issues.apache.org/jira/browse/HBASE-6281 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6281-trunk-v2.txt, HBASE-6281_92.patch, HBASE-6281_94.patch, HBASE-6281_94_2.patch, HBASE-6281_trunk.patch Currently during clean cluster start up if there are tables in DISABLING state, we do bulk assignment through assignAllUserRegions() and after region is OPENED in RS, master checks if the table is in DISBALING/DISABLED state (in Am.regionOnline) and again calls unassign. This roundtrip can be avoided even before calling assignment. This JIRA is to address the above scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406594#comment-13406594 ] ramkrishna.s.vasudevan commented on HBASE-6299: --- Am very sorry for not making it clear. {code} if (t instanceof RegionAlreadyInTransitionException) { String errorMsg = Failed assignment in: + plan.getDestination() + due to + t.getMessage(); LOG.error(errorMsg, t); return; } {code} The above piece of code is correct. The RegionAlreadyInTransition is of type RemoteException. So we need to unwrap it. In the current patch {code} if (t instanceof RegionAlreadyInTransitionException) { + String errorMsg = Failed assignment in: + plan.getDestination() + + due to + t.getMessage(); + LOG.error(errorMsg, t); + return; +} {code} This is done. It will not work. We just did a small verification of this. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29
[jira] [Commented] (HBASE-6306) TestFSUtils fails against hadoop 2.0
[ https://issues.apache.org/jira/browse/HBASE-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406593#comment-13406593 ] Jonathan Hsieh commented on HBASE-6306: --- Ram, correct. The test is new for 0.96, and the check function is different in 0.94. (previously used fs.exists, now uses fs.filestatus). TestFSUtils fails against hadoop 2.0 Key: HBASE-6306 URL: https://issues.apache.org/jira/browse/HBASE-6306 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.96.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Fix For: 0.96.0 Attachments: hbase-6306-trunk.patch trunk: mvn clean test -Dhadoop.profile=2.0 -Dtest=TestFSUtils {code} java.io.FileNotFoundException: File /home/jon/proj/hbase-trunk/hbase-server/target/test-data/02beb8c8-06c1-47ea-829b-6e7ce0570cf8/hbase.version does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:315) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1279) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1319) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:557) at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:213) at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:270) at org.apache.hadoop.hbase.util.TestFSUtils.testVersion(TestFSUtils.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6327) HLog can be null when create table
[ https://issues.apache.org/jira/browse/HBASE-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShiXing updated HBASE-6327: --- Description: As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,we expect all the regions for the table can also be assigned. I will give the patch for this later. was: As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,all the regions for the table can also be assigned. I will give the patch for this later. HLog can be null when create table -- Key: HBASE-6327 URL: https://issues.apache.org/jira/browse/HBASE-6327 Project: HBase Issue Type: Bug Reporter: ShiXing Assignee: ShiXing Attachments: createTableFailedMaster.log As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,we expect all the regions for the table can also be assigned. I will give the patch for this later. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6283) [region_mover.rb] Add option to exclude list of hosts on unload instead of just assuming the source node.
[ https://issues.apache.org/jira/browse/HBASE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406614#comment-13406614 ] Jonathan Hsieh commented on HBASE-6283: --- bq. He has contrib'd the non-SU stuff: i.e. the bit where can register in zk what regionservers are being rolled. I diffed his region_mover.rb script from trunk's and they are still some significant differences between the two related to the zk bits in the ruby script side. In my case, I'm trying to help a customer in a particular situation who is still on 0.90 (didn't get included as part of HBASE-4298) so the draining zk bit isn't going to be helpful. For this patch, I think I'll tweak to address you comments, commit to trunk (should I do the other versions too?), and then we should encourage aravind to contribute/port the jruby bits as well. Sound good? [region_mover.rb] Add option to exclude list of hosts on unload instead of just assuming the source node. - Key: HBASE-6283 URL: https://issues.apache.org/jira/browse/HBASE-6283 Project: HBase Issue Type: Improvement Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Labels: jruby Attachments: hbase-6283.patch Currently, the region_mover.rb script excludes a single host, the host offloading data, as a region move target. This essentially limits the number of machines that can be shutdown at a time to one. For larger clusters, it is manageable to have several nodes down at a time and desirable to get this process done more quickly. The proposed patch adds an exclude file option, that allows multiple hosts to be excluded as targets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6327) HLog can be null when create table
[ https://issues.apache.org/jira/browse/HBASE-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShiXing updated HBASE-6327: --- Attachment: HBASE-6327-trunk-V1.patch The trunk code of the HLog.sync() use group sync, the interruption as described will not affect the createTable(). But I think we can save little time and simplify the createTable() logic. There is no ut. HLog can be null when create table -- Key: HBASE-6327 URL: https://issues.apache.org/jira/browse/HBASE-6327 Project: HBase Issue Type: Bug Reporter: ShiXing Assignee: ShiXing Attachments: HBASE-6327-trunk-V1.patch, createTableFailedMaster.log As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,we expect all the regions for the table can also be assigned. I will give the patch for this later. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6117) Revisit default condition added to Switch cases in Trunk
[ https://issues.apache.org/jira/browse/HBASE-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-6117: -- Attachment: HBASE-6117_1.patch This is what i committed. Thanks for the review Stack. Revisit default condition added to Switch cases in Trunk Key: HBASE-6117 URL: https://issues.apache.org/jira/browse/HBASE-6117 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.96.0 Attachments: HBASE-6117.patch, HBASE-6117_1.patch We found that in some cases the default case in switch block was just throwing illegalArg Exception. There are cases where we may get some other state for which we should not throw IllegalArgException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6117) Revisit default condition added to Switch cases in Trunk
[ https://issues.apache.org/jira/browse/HBASE-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-6117: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Revisit default condition added to Switch cases in Trunk Key: HBASE-6117 URL: https://issues.apache.org/jira/browse/HBASE-6117 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.96.0 Attachments: HBASE-6117.patch, HBASE-6117_1.patch We found that in some cases the default case in switch block was just throwing illegalArg Exception. There are cases where we may get some other state for which we should not throw IllegalArgException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6281) Assignment need not be called for disabling table regions during clean cluster start up.
[ https://issues.apache.org/jira/browse/HBASE-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rajeshbabu updated HBASE-6281: -- Attachment: 6281.addendum Sorry for the mistake. Added addendum addressing Ted's comment. Assignment need not be called for disabling table regions during clean cluster start up. Key: HBASE-6281 URL: https://issues.apache.org/jira/browse/HBASE-6281 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6281-trunk-v2.txt, 6281.addendum, HBASE-6281_92.patch, HBASE-6281_94.patch, HBASE-6281_94_2.patch, HBASE-6281_trunk.patch Currently during clean cluster start up if there are tables in DISABLING state, we do bulk assignment through assignAllUserRegions() and after region is OPENED in RS, master checks if the table is in DISBALING/DISABLED state (in Am.regionOnline) and again calls unassign. This roundtrip can be avoided even before calling assignment. This JIRA is to address the above scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
[ https://issues.apache.org/jira/browse/HBASE-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-6328: --- Attachment: 6328.v1.patch FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it --- Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Priority: Minor Attachments: 6328.v1.patch Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
[ https://issues.apache.org/jira/browse/HBASE-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406660#comment-13406660 ] nkeywal commented on HBASE-6328: Trivial patch. Unit tests ok. Will commit in ~3 days if I don't get no go. FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it --- Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Priority: Minor Attachments: 6328.v1.patch Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
[ https://issues.apache.org/jira/browse/HBASE-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal reassigned HBASE-6328: -- Assignee: nkeywal FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it --- Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 6328.v1.patch Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4791) Allow Secure Zookeeper JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file)
[ https://issues.apache.org/jira/browse/HBASE-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-4791: --- Attachment: (was: HBASE-4791-v1.patch) Allow Secure Zookeeper JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file) -- Key: HBASE-4791 URL: https://issues.apache.org/jira/browse/HBASE-4791 Project: HBase Issue Type: Improvement Components: security, zookeeper Reporter: Eugene Koontz Assignee: Eugene Koontz Labels: security, zookeeper Attachments: DemoConfig.java In the currently proposed fix for HBASE-2418, there must be a JAAS file specified in System.setProperty(java.security.auth.login.config). However, it might be preferable to construct a JAAS configuration programmatically, as is done with secure Hadoop (see https://github.com/apache/hadoop-common/blob/a48eceb62c9b5c1a5d71ee2945d9eea2ed62527b/src/java/org/apache/hadoop/security/UserGroupInformation.java#L175). This would have the benefit of avoiding a usage of a system property setting, and allow instead an HBase-local configuration setting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4791) Allow Secure Zookeeper JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file)
[ https://issues.apache.org/jira/browse/HBASE-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-4791: --- Attachment: (was: HBASE-4791-v0.patch) Allow Secure Zookeeper JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file) -- Key: HBASE-4791 URL: https://issues.apache.org/jira/browse/HBASE-4791 Project: HBase Issue Type: Improvement Components: security, zookeeper Reporter: Eugene Koontz Assignee: Eugene Koontz Labels: security, zookeeper Attachments: DemoConfig.java In the currently proposed fix for HBASE-2418, there must be a JAAS file specified in System.setProperty(java.security.auth.login.config). However, it might be preferable to construct a JAAS configuration programmatically, as is done with secure Hadoop (see https://github.com/apache/hadoop-common/blob/a48eceb62c9b5c1a5d71ee2945d9eea2ed62527b/src/java/org/apache/hadoop/security/UserGroupInformation.java#L175). This would have the benefit of avoiding a usage of a system property setting, and allow instead an HBase-local configuration setting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6322) Unnecessary creation of finalizers in HTablePool
[ https://issues.apache.org/jira/browse/HBASE-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406665#comment-13406665 ] Hudson commented on HBASE-6322: --- Integrated in HBase-0.92 #466 (See [https://builds.apache.org/job/HBase-0.92/466/]) HBASE-6322 Unnecessary creation of finalizers in HTablePool (Revision 1357291) HBASE-6322 Unnecessary creation of finalizers in HTablePool (Revision 1357285) Result = FAILURE stack : Files : * /hbase/branches/0.92/CHANGES.txt stack : Files : * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/client/HTablePool.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/client/TestHTablePool.java Unnecessary creation of finalizers in HTablePool Key: HBASE-6322 URL: https://issues.apache.org/jira/browse/HBASE-6322 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Ryan Brush Fix For: 0.92.2 Attachments: HBASE-6322-0.92.1.patch, HBASE-6322-trunk.1.patch From a mailing list question: While generating some load against a library that makes extensive use of HTablePool in 0.92, I noticed that the largest heap consumer was java.lang.ref.Finalizer. Digging in, I discovered that HTablePool's internal PooledHTable extends HTable, which instantiates a ThreadPoolExecutor and supporting objects every time a pooled HTable is retrieved. Since ThreadPoolExecutor has a finalizer, it and its dependencies can't get garbage collected until the finalizer runs. The result is by using HTablePool, we're creating a ton of objects to be finalized that are stuck on the heap longer than they should be, creating our largest source of pressure on the garbage collector. It looks like this will also be a problem in 0.94 and trunk. The easy fix is just to have PooledHTable implement HTableInterface (rather than subclass HTable), but this does break a unit test that explicitly checks that PooledHTable implements HTable -- I can only assume this test is there for some historical passivity reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6281) Assignment need not be called for disabling table regions during clean cluster start up.
[ https://issues.apache.org/jira/browse/HBASE-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406664#comment-13406664 ] Hudson commented on HBASE-6281: --- Integrated in HBase-0.92 #466 (See [https://builds.apache.org/job/HBase-0.92/466/]) HBASE-6281 Assignment need not be called for disabling table regions during clean cluster start up (Revision 1357302) Result = FAILURE stack : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java Assignment need not be called for disabling table regions during clean cluster start up. Key: HBASE-6281 URL: https://issues.apache.org/jira/browse/HBASE-6281 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6281-trunk-v2.txt, 6281.addendum, HBASE-6281_92.patch, HBASE-6281_94.patch, HBASE-6281_94_2.patch, HBASE-6281_trunk.patch Currently during clean cluster start up if there are tables in DISABLING state, we do bulk assignment through assignAllUserRegions() and after region is OPENED in RS, master checks if the table is in DISBALING/DISABLED state (in Am.regionOnline) and again calls unassign. This roundtrip can be avoided even before calling assignment. This JIRA is to address the above scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4791) Allow Secure Zookeeper JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file)
[ https://issues.apache.org/jira/browse/HBASE-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-4791: --- Attachment: HBASE-4791-v1.patch Attached a patch that depends on ZOOKEEPER-1497, just to be able to start secure zookeeper from hbase (non distributed mode). using instead hbase-site.xml configuration * hbase.zookeeper.client.keytab.file * hbase.zookeeper.client.kerberos.principal Client properties are used by HBase Master and Region Servers. * hbase.zookeeper.server.keytab.file * hbase.zookeeper.server.kerberos.principal Server properties are used by Quorum Peer when zookeepe is not external. Allow Secure Zookeeper JAAS configuration to be programmatically set (rather than only by reading JAAS configuration file) -- Key: HBASE-4791 URL: https://issues.apache.org/jira/browse/HBASE-4791 Project: HBase Issue Type: Improvement Components: security, zookeeper Reporter: Eugene Koontz Assignee: Eugene Koontz Labels: security, zookeeper Attachments: DemoConfig.java, HBASE-4791-v1.patch In the currently proposed fix for HBASE-2418, there must be a JAAS file specified in System.setProperty(java.security.auth.login.config). However, it might be preferable to construct a JAAS configuration programmatically, as is done with secure Hadoop (see https://github.com/apache/hadoop-common/blob/a48eceb62c9b5c1a5d71ee2945d9eea2ed62527b/src/java/org/apache/hadoop/security/UserGroupInformation.java#L175). This would have the benefit of avoiding a usage of a system property setting, and allow instead an HBase-local configuration setting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5955) Guava 11 drops MapEvictionListener and Hadoop 2.0.0-alpha requires it
[ https://issues.apache.org/jira/browse/HBASE-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406674#comment-13406674 ] Hudson commented on HBASE-5955: --- Integrated in HBase-0.94 #290 (See [https://builds.apache.org/job/HBase-0.94/290/]) HBASE-5955 Guava 11 drops MapEvictionListener and Hadoop 2.0.0-alpha requires it (Revision 1356379) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/pom.xml * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/slab/SingleSizeCache.java Guava 11 drops MapEvictionListener and Hadoop 2.0.0-alpha requires it - Key: HBASE-5955 URL: https://issues.apache.org/jira/browse/HBASE-5955 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: Andrew Purtell Assignee: Lars Hofhansl Fix For: 0.94.1 Attachments: 5955.txt Hadoop 2.0.0-alpha depends on Guava 11.0.2. Updating HBase dependencies to match produces the following compilation errors: {code} [ERROR] SingleSizeCache.java:[41,32] cannot find symbol [ERROR] symbol : class MapEvictionListener [ERROR] location: package com.google.common.collect [ERROR] [ERROR] SingleSizeCache.java:[94,4] cannot find symbol [ERROR] symbol : class MapEvictionListener [ERROR] location: class org.apache.hadoop.hbase.io.hfile.slab.SingleSizeCache [ERROR] [ERROR] SingleSizeCache.java:[94,69] cannot find symbol [ERROR] symbol : class MapEvictionListener [ERROR] location: class org.apache.hadoop.hbase.io.hfile.slab.SingleSizeCache {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6281) Assignment need not be called for disabling table regions during clean cluster start up.
[ https://issues.apache.org/jira/browse/HBASE-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406675#comment-13406675 ] Hudson commented on HBASE-6281: --- Integrated in HBase-0.94 #290 (See [https://builds.apache.org/job/HBase-0.94/290/]) HBASE-6281 Assignment need not be called for disabling table regions during clean cluster start up. (Rajesh) Submitted by:Rajesh Reviewed by:Ram, Stack, Ted (Revision 1356396) Result = SUCCESS ramkrishna : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java Assignment need not be called for disabling table regions during clean cluster start up. Key: HBASE-6281 URL: https://issues.apache.org/jira/browse/HBASE-6281 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6281-trunk-v2.txt, 6281.addendum, HBASE-6281_92.patch, HBASE-6281_94.patch, HBASE-6281_94_2.patch, HBASE-6281_trunk.patch Currently during clean cluster start up if there are tables in DISABLING state, we do bulk assignment through assignAllUserRegions() and after region is OPENED in RS, master checks if the table is in DISBALING/DISABLED state (in Am.regionOnline) and again calls unassign. This roundtrip can be avoided even before calling assignment. This JIRA is to address the above scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6303) HCD.setCompressionType should use Enum support for storing compression types as strings
[ https://issues.apache.org/jira/browse/HBASE-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406676#comment-13406676 ] Hudson commented on HBASE-6303: --- Integrated in HBase-0.94 #290 (See [https://builds.apache.org/job/HBase-0.94/290/]) Amend HBASE-6303. Likewise for HCD.setCompactionCompressionType (Revision 1356569) HBASE-6303. HCD.setCompressionType should use Enum support for storing compression types as strings (Revision 1356518) Result = SUCCESS apurtell : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java apurtell : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java HCD.setCompressionType should use Enum support for storing compression types as strings --- Key: HBASE-6303 URL: https://issues.apache.org/jira/browse/HBASE-6303 Project: HBase Issue Type: Bug Components: io Affects Versions: 0.94.0, 0.96.0 Reporter: Gopinathan A Assignee: Andrew Purtell Priority: Minor Fix For: 0.96.0, 0.94.1 Attachments: HBASE-6303-0.94.patch, HBASE-6303-addendum-0.94.patch, HBASE-6303-addendum-trunk.patch, HBASE-6303-trunk.patch Let's not require an update to HCD every time the HFile compression enum is changed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5876) TestImportExport has been failing against hadoop 0.23 profile
[ https://issues.apache.org/jira/browse/HBASE-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406709#comment-13406709 ] Jonathan Hsieh commented on HBASE-5876: --- Its been a few days, I'm going to commit later today unless I hear anything suggesting not to. TestImportExport has been failing against hadoop 0.23 profile - Key: HBASE-5876 URL: https://issues.apache.org/jira/browse/HBASE-5876 Project: HBase Issue Type: Bug Affects Versions: 0.94.0, 0.96.0 Reporter: Zhihong Ted Yu Assignee: Jonathan Hsieh Fix For: 0.96.0, 0.94.1 Attachments: hbase-5876-94-v3.patch, hbase-5876-94.patch, hbase-5876-trunk-v3.patch, hbase-5876-v2.patch, hbase-5876.patch TestImportExport has been failing against hadoop 0.23 profile -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6281) Assignment need not be called for disabling table regions during clean cluster start up.
[ https://issues.apache.org/jira/browse/HBASE-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406711#comment-13406711 ] stack commented on HBASE-6281: -- Added the addendum. Thanks Rajesh. Assignment need not be called for disabling table regions during clean cluster start up. Key: HBASE-6281 URL: https://issues.apache.org/jira/browse/HBASE-6281 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6281-trunk-v2.txt, 6281.addendum, HBASE-6281_92.patch, HBASE-6281_94.patch, HBASE-6281_94_2.patch, HBASE-6281_trunk.patch Currently during clean cluster start up if there are tables in DISABLING state, we do bulk assignment through assignAllUserRegions() and after region is OPENED in RS, master checks if the table is in DISBALING/DISABLED state (in Am.regionOnline) and again calls unassign. This roundtrip can be avoided even before calling assignment. This JIRA is to address the above scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6283) [region_mover.rb] Add option to exclude list of hosts on unload instead of just assuming the source node.
[ https://issues.apache.org/jira/browse/HBASE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406712#comment-13406712 ] stack commented on HBASE-6283: -- Other versions, yes. Aravind doesn't work on this stuff any more. If you open new issue, one of the two of us can take in the diff. Good on you J. [region_mover.rb] Add option to exclude list of hosts on unload instead of just assuming the source node. - Key: HBASE-6283 URL: https://issues.apache.org/jira/browse/HBASE-6283 Project: HBase Issue Type: Improvement Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Labels: jruby Attachments: hbase-6283.patch Currently, the region_mover.rb script excludes a single host, the host offloading data, as a region move target. This essentially limits the number of machines that can be shutdown at a time to one. For larger clusters, it is manageable to have several nodes down at a time and desirable to get this process done more quickly. The proposed patch adds an exclude file option, that allows multiple hosts to be excluded as targets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6228) Fixup daughters twice cause daughter region assigned twice
[ https://issues.apache.org/jira/browse/HBASE-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406715#comment-13406715 ] stack commented on HBASE-6228: -- I've been looking at hbase-6060 as a background task (sorry, its taking me a while Ram and Rajesh to get back to you lot). When I put together multiple threads (SSH, HMaster joining cluster, single vs bulk assigning/unassign, timeout monitor, zk callbacks etc.) and then try to trace state changes not only across multiple state keepers (RegionState, RegionInTransition, AM#this.regions and AM#this.servers) in the master process but then also too x-process master - regionserver - via zk, I want to throw out what we have and start over (smile). That ain't going to happen though. Meantime I think we need to identify patterns or practices and broadcast them so all can sign on. For example, I appreciate stuff like Jimmy's small win simplifying AM breaking out RegionStates into a standalone class apart from AM. This at least collects a bunch of in-memory state in the one place. We also need to have more tests I'd say so we can have some confidence stuff still works when we shift things around. Fixup daughters twice cause daughter region assigned twice --- Key: HBASE-6228 URL: https://issues.apache.org/jira/browse/HBASE-6228 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6228.patch, HBASE-6228v2.patch, HBASE-6228v2.patch, HBASE-6228v3.patch, HBASE-6228v4.patch First, how fixup daughters twice happen? 1.we will fixupDaughters at the last of HMaster#finishInitialization 2.ServerShutdownHandler will fixupDaughters when reassigning region through ServerShutdownHandler#processDeadRegion When fixupDaughters, we will added daughters to .META., but it coudn't prevent the above case, because FindDaughterVisitor. The detail is as the following: Suppose region A is a splitted parent region, and its daughter region B is missing 1.First, ServerShutdownHander thread fixup daughter, so add daughter region B to .META. with serverName=null, and assign the daughter. 2.Then, Master's initialization thread will also find the daughter region B is missing and assign it. It is because FindDaughterVisitor consider daughter is missing if its serverName=null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (HBASE-6322) Unnecessary creation of finalizers in HTablePool
[ https://issues.apache.org/jira/browse/HBASE-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu reopened HBASE-6322: --- From https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/466/testReport/org.apache.hadoop.hbase.rest/TestTableResource/testTableInfoText/: {code} java.lang.AssertionError: expected:500 but was:200 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.hadoop.hbase.rest.TestTableResource.testTableInfoText(TestTableResource.java:215) {code} The failure is reproducible locally. In the same test output you can see: {code} 2012-07-04 18:53:29,338 ERROR [2535725@qtp-29012646-0] log.Slf4jLog(87): /TestTableResource/regions java.lang.ClassCastException: org.apache.hadoop.hbase.client.HTablePool$PooledHTable cannot be cast to org.apache.hadoop.hbase.client.HTable {code} Unnecessary creation of finalizers in HTablePool Key: HBASE-6322 URL: https://issues.apache.org/jira/browse/HBASE-6322 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Ryan Brush Fix For: 0.92.2 Attachments: HBASE-6322-0.92.1.patch, HBASE-6322-trunk.1.patch From a mailing list question: While generating some load against a library that makes extensive use of HTablePool in 0.92, I noticed that the largest heap consumer was java.lang.ref.Finalizer. Digging in, I discovered that HTablePool's internal PooledHTable extends HTable, which instantiates a ThreadPoolExecutor and supporting objects every time a pooled HTable is retrieved. Since ThreadPoolExecutor has a finalizer, it and its dependencies can't get garbage collected until the finalizer runs. The result is by using HTablePool, we're creating a ton of objects to be finalized that are stuck on the heap longer than they should be, creating our largest source of pressure on the garbage collector. It looks like this will also be a problem in 0.94 and trunk. The easy fix is just to have PooledHTable implement HTableInterface (rather than subclass HTable), but this does break a unit test that explicitly checks that PooledHTable implements HTable -- I can only assume this test is there for some historical passivity reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
[ https://issues.apache.org/jira/browse/HBASE-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6328: -- Hadoop Flags: Reviewed Status: Patch Available (was: Open) Looks good to me. FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it --- Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 6328.v1.patch Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406735#comment-13406735 ] Hadoop QA commented on HBASE-6299: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12535045/HBASE-6299-v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 7 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2318//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2318//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2318//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2318//console This message is automatically generated. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG
[jira] [Updated] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
[ https://issues.apache.org/jira/browse/HBASE-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6328: -- Fix Version/s: 0.94.1 0.96.0 0.92.2 I found the same code in 0.92 and 0.94 FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it --- Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6328.v1.patch Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6296) Refactor EventType to track its own ExecutorService type
[ https://issues.apache.org/jira/browse/HBASE-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-6296: - Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to trunk. I ran tests locally w/ this patch applied and no failures or errors. Thanks for the clean up Jesse. Refactor EventType to track its own ExecutorService type Key: HBASE-6296 URL: https://issues.apache.org/jira/browse/HBASE-6296 Project: HBase Issue Type: Improvement Components: master Affects Versions: 0.96.0 Reporter: Jesse Yates Assignee: Jesse Yates Priority: Minor Fix For: 0.96.0 Attachments: 6296v1.txt, 6296v1.txt, java_hbase-6296-v0.patch, java_hbase-6296-v0.patch Currently there is a massive switch statement in org.apache.hadoop.hbase.executor.ExecutorService for the ExecutorType for each org.apache.hadoop.hbase.executor.EventHandler.EventType. This means is you add an new event type, you will also have to change the executorservice file, if for nothing but to add the executor type. Instead the EventType should just be able to keep track of which executor it should use. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6228) Fixup daughters twice cause daughter region assigned twice
[ https://issues.apache.org/jira/browse/HBASE-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406753#comment-13406753 ] stack commented on HBASE-6228: -- For example, on study: + RegionState is unreliable figuring state of region in master's memory; you cannot rely on it to answer the bigger question of who a region belongs to: master or regionserver. + In assign, we are careful with retries. We actually need to be more careful especially around things like socket timeout (see Maryann's recent issue). Bulk assign does none of these checks. Bulk assign was introduced originally to do assigns on cluster start; if anything failed, contract was we'd just crash out and restart cluster over. That was how it was originally. Now bulk assign is used all over -- e.g. in SSH -- in spite of its being loosey-goosey around failures. Fixup daughters twice cause daughter region assigned twice --- Key: HBASE-6228 URL: https://issues.apache.org/jira/browse/HBASE-6228 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0 Attachments: HBASE-6228.patch, HBASE-6228v2.patch, HBASE-6228v2.patch, HBASE-6228v3.patch, HBASE-6228v4.patch First, how fixup daughters twice happen? 1.we will fixupDaughters at the last of HMaster#finishInitialization 2.ServerShutdownHandler will fixupDaughters when reassigning region through ServerShutdownHandler#processDeadRegion When fixupDaughters, we will added daughters to .META., but it coudn't prevent the above case, because FindDaughterVisitor. The detail is as the following: Suppose region A is a splitted parent region, and its daughter region B is missing 1.First, ServerShutdownHander thread fixup daughter, so add daughter region B to .META. with serverName=null, and assign the daughter. 2.Then, Master's initialization thread will also find the daughter region B is missing and assign it. It is because FindDaughterVisitor consider daughter is missing if its serverName=null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
[ https://issues.apache.org/jira/browse/HBASE-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406755#comment-13406755 ] stack commented on HBASE-6328: -- +1 on patch. Apply it to all branches I'd say Nicolas. FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it --- Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6328.v1.patch Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6322) Unnecessary creation of finalizers in HTablePool
[ https://issues.apache.org/jira/browse/HBASE-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406757#comment-13406757 ] Ryan Brush commented on HBASE-6322: --- My apologies...I had only run the tests around HTablePool since I had run into some apparently unrelated test failures in a full build. (And I didn't expect it to be included in the build so quickly. ;) It looks like we'll need to do some refactoring in REST server's RegionResource for this to apply cleanly, specifically the call to HTable.getRegionsInfo which requires the downcast (and is deprecated, anyway). I'm not deeply familiar with this part of the code base and probably won't be able to dig into it today, but can get back to it in the next couple days, as well as making sure there aren't any further regressions caused by this change. Unnecessary creation of finalizers in HTablePool Key: HBASE-6322 URL: https://issues.apache.org/jira/browse/HBASE-6322 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Ryan Brush Fix For: 0.92.2 Attachments: HBASE-6322-0.92.1.patch, HBASE-6322-trunk.1.patch From a mailing list question: While generating some load against a library that makes extensive use of HTablePool in 0.92, I noticed that the largest heap consumer was java.lang.ref.Finalizer. Digging in, I discovered that HTablePool's internal PooledHTable extends HTable, which instantiates a ThreadPoolExecutor and supporting objects every time a pooled HTable is retrieved. Since ThreadPoolExecutor has a finalizer, it and its dependencies can't get garbage collected until the finalizer runs. The result is by using HTablePool, we're creating a ton of objects to be finalized that are stuck on the heap longer than they should be, creating our largest source of pressure on the garbage collector. It looks like this will also be a problem in 0.94 and trunk. The easy fix is just to have PooledHTable implement HTableInterface (rather than subclass HTable), but this does break a unit test that explicitly checks that PooledHTable implements HTable -- I can only assume this test is there for some historical passivity reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6322) Unnecessary creation of finalizers in HTablePool
[ https://issues.apache.org/jira/browse/HBASE-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406760#comment-13406760 ] stack commented on HBASE-6322: -- Thanks Ted. I reverted the patch for now. Thank your time Ryan. Patch looks worth it if you can figure the test fail. Thanks. Unnecessary creation of finalizers in HTablePool Key: HBASE-6322 URL: https://issues.apache.org/jira/browse/HBASE-6322 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Ryan Brush Fix For: 0.92.2 Attachments: HBASE-6322-0.92.1.patch, HBASE-6322-trunk.1.patch From a mailing list question: While generating some load against a library that makes extensive use of HTablePool in 0.92, I noticed that the largest heap consumer was java.lang.ref.Finalizer. Digging in, I discovered that HTablePool's internal PooledHTable extends HTable, which instantiates a ThreadPoolExecutor and supporting objects every time a pooled HTable is retrieved. Since ThreadPoolExecutor has a finalizer, it and its dependencies can't get garbage collected until the finalizer runs. The result is by using HTablePool, we're creating a ton of objects to be finalized that are stuck on the heap longer than they should be, creating our largest source of pressure on the garbage collector. It looks like this will also be a problem in 0.94 and trunk. The easy fix is just to have PooledHTable implement HTableInterface (rather than subclass HTable), but this does break a unit test that explicitly checks that PooledHTable implements HTable -- I can only assume this test is there for some historical passivity reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires
[ https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406763#comment-13406763 ] Himanshu Vashishtha commented on HBASE-5549: This seems to fix the HBaseTestingUtility#expiresession method as it introduced a new logic of creating a monitor and then expiring the session. But it seems this fix needs some more work? For example, TestReplicationPeer occasionally fails even with this change on the trunk, citing improper session termination. {code} testResetZooKeeperSession(org.apache.hadoop.hbase.replication.TestReplicationPeer): ReplicationPeer ZooKeeper session was not properly expired. {code} On another note, I wonder whether this patch can be backported to 0.92/0.94? Master can fail if ZooKeeper session expires Key: HBASE-5549 URL: https://issues.apache.org/jira/browse/HBASE-5549 Project: HBase Issue Type: Bug Components: master, zookeeper Affects Versions: 0.96.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5549.v10.patch, 5549.v11.patch, 5549.v6.patch, 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch There is a retry mechanism in RecoverableZooKeeper, but when the session expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism does not work in this case. This is why a sleep is needed in TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher to be recreated before using the connection. This can happen in real life, it can happen when: - master zookeeper starts - zookeeper connection is cut - master enters the retry loop - in the meantime the session expires - the network comes back, the session is recreated - the retries continues, but on the wrong object, hence fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6293) HMaster does not go down while splitting logs even if explicit shutdown is called.
[ https://issues.apache.org/jira/browse/HBASE-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406764#comment-13406764 ] Hadoop QA commented on HBASE-6293: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12534877/6293.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 7 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.security.access.TestZKPermissionsWatcher Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2320//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2320//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2320//console This message is automatically generated. HMaster does not go down while splitting logs even if explicit shutdown is called. -- Key: HBASE-6293 URL: https://issues.apache.org/jira/browse/HBASE-6293 Project: HBase Issue Type: Bug Affects Versions: 0.92.1, 0.94.0 Reporter: rajeshbabu Assignee: rajeshbabu Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6293.txt When master starts up and tries to do splitlog, in case of any error we try to do that infinitely in a loop until it succeeds. But now if we get a shutdown call, inside SplitLogManager {code} if (stopper.isStopped()) { LOG.warn(Stopped while waiting for log splits to be completed); return; } {code} Here we know that the master has stopped. As the task may not be completed now {code} if (batch.done != batch.installed) { batch.isDead = true; tot_mgr_log_split_batch_err.incrementAndGet(); LOG.warn(error while splitting logs in + logDirs + installed = + batch.installed + but only + batch.done + done); throw new IOException(error or interrupt while splitting logs in + logDirs + Task = + batch); } {code} we throw an exception. In MasterFileSystem.splitLogAfterStartup() we don't check if the master is stopped and we try continously. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6288) In hbase-daemons.sh, description of the default backup-master file path is wrong
[ https://issues.apache.org/jira/browse/HBASE-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406767#comment-13406767 ] stack commented on HBASE-6288: -- Sounds right Benjamin. Can you make a patch w/ your fix? In hbase-daemons.sh, description of the default backup-master file path is wrong Key: HBASE-6288 URL: https://issues.apache.org/jira/browse/HBASE-6288 Project: HBase Issue Type: Task Components: master, scripts, shell Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Benjamin Kim In hbase-daemons.sh, description of the default backup-master file path is wrong {code} # HBASE_BACKUP_MASTERS File naming remote hosts. # Default is ${HADOOP_CONF_DIR}/backup-masters {code} it says the default backup-masters file path is at a hadoop-conf-dir, but shouldn't this be HBASE_CONF_DIR? also adding following lines to conf/hbase-env.sh would be helpful {code} # File naming hosts on which backup HMaster will run. $HBASE_HOME/conf/backup-masters by default. export HBASE_BACKUP_MASTERS=${HBASE_HOME}/conf/backup-masters {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6305) TestLocalHBaseCluster hangs with hadoop 2.0/0.23 builds.
[ https://issues.apache.org/jira/browse/HBASE-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406771#comment-13406771 ] Hadoop QA commented on HBASE-6305: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12534998/hbase-6305-94.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2321//console This message is automatically generated. TestLocalHBaseCluster hangs with hadoop 2.0/0.23 builds. Key: HBASE-6305 URL: https://issues.apache.org/jira/browse/HBASE-6305 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.2, 0.94.1 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Fix For: 0.92.2, 0.94.1 Attachments: hbase-6305-94.patch trunk: mvn clean test -Dhadoop.profile=2.0 -Dtest=TestLocalHBaseCluster 0.94: mvn clean test -Dhadoop.profile=23 -Dtest=TestLocalHBaseCluster {code} testLocalHBaseCluster(org.apache.hadoop.hbase.TestLocalHBaseCluster) Time elapsed: 0.022 sec ERROR! java.lang.RuntimeException: Master not initialized after 200 seconds at org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:208) at org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:424) at org.apache.hadoop.hbase.TestLocalHBaseCluster.testLocalHBaseCluster(TestLocalHBaseCluster.java:66) ... {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6027) Update the reference guide to reflect the changes in the security profile
[ https://issues.apache.org/jira/browse/HBASE-6027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406772#comment-13406772 ] Hadoop QA commented on HBASE-6027: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12534996/6027-1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2322//console This message is automatically generated. Update the reference guide to reflect the changes in the security profile - Key: HBASE-6027 URL: https://issues.apache.org/jira/browse/HBASE-6027 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.96.0 Reporter: Devaraj Das Assignee: Devaraj Das Fix For: 0.96.0 Attachments: 6027-1.patch The refguide needs to be updated to reflect the fact that there is no security profile anymore, etc. [Follow up to HBASE-5732] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6328) FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
[ https://issues.apache.org/jira/browse/HBASE-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406791#comment-13406791 ] Hadoop QA commented on HBASE-6328: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12535129/6328.v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 7 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2323//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2323//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2323//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2323//console This message is automatically generated. FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it --- Key: HBASE-6328 URL: https://issues.apache.org/jira/browse/HBASE-6328 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.92.2, 0.96.0, 0.94.1 Attachments: 6328.v1.patch Coding error is: {noformat} try { Thread.sleep(1000); } catch (InterruptedException ex) { new InterruptedIOException().initCause(ex); } {noformat} The exception is created but not thrown... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6322) Unnecessary creation of finalizers in HTablePool
[ https://issues.apache.org/jira/browse/HBASE-6322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406792#comment-13406792 ] Zhihong Ted Yu commented on HBASE-6322: --- Looks like we can add the following to HTableInterface in trunk: {code} public NavigableMapHRegionInfo, ServerName getRegionLocations() throws IOException { {code} so that RegionsResource can use it instead of getRegionsInfo(). And we don't need a cast in getTableRegions(). Unnecessary creation of finalizers in HTablePool Key: HBASE-6322 URL: https://issues.apache.org/jira/browse/HBASE-6322 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0, 0.92.1, 0.94.0 Reporter: Ryan Brush Fix For: 0.92.2 Attachments: HBASE-6322-0.92.1.patch, HBASE-6322-trunk.1.patch From a mailing list question: While generating some load against a library that makes extensive use of HTablePool in 0.92, I noticed that the largest heap consumer was java.lang.ref.Finalizer. Digging in, I discovered that HTablePool's internal PooledHTable extends HTable, which instantiates a ThreadPoolExecutor and supporting objects every time a pooled HTable is retrieved. Since ThreadPoolExecutor has a finalizer, it and its dependencies can't get garbage collected until the finalizer runs. The result is by using HTablePool, we're creating a ton of objects to be finalized that are stuck on the heap longer than they should be, creating our largest source of pressure on the garbage collector. It looks like this will also be a problem in 0.94 and trunk. The easy fix is just to have PooledHTable implement HTableInterface (rather than subclass HTable), but this does break a unit test that explicitly checks that PooledHTable implements HTable -- I can only assume this test is there for some historical passivity reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6327) HLog can be null when create table
[ https://issues.apache.org/jira/browse/HBASE-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406802#comment-13406802 ] Zhihong Ted Yu commented on HBASE-6327: --- {code} } -hlog.closeAndDelete(); {code} If hlog isn't null, we still need to call closeAndDelete(). HLog can be null when create table -- Key: HBASE-6327 URL: https://issues.apache.org/jira/browse/HBASE-6327 Project: HBase Issue Type: Bug Reporter: ShiXing Assignee: ShiXing Attachments: HBASE-6327-trunk-V1.patch, createTableFailedMaster.log As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,we expect all the regions for the table can also be assigned. I will give the patch for this later. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (HBASE-6327) HLog can be null when create table
[ https://issues.apache.org/jira/browse/HBASE-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406802#comment-13406802 ] Zhihong Ted Yu edited comment on HBASE-6327 at 7/5/12 1:08 AM: --- @Xing: Have you tested the change in real cluster ? was (Author: zhi...@ebaysf.com): {code} } -hlog.closeAndDelete(); {code} If hlog isn't null, we still need to call closeAndDelete(). HLog can be null when create table -- Key: HBASE-6327 URL: https://issues.apache.org/jira/browse/HBASE-6327 Project: HBase Issue Type: Bug Reporter: ShiXing Assignee: ShiXing Attachments: HBASE-6327-trunk-V1.patch, createTableFailedMaster.log As HBASE-4010 discussed, the HLog can be null. We have meet createTable failed because the no use hlog. When createHReagion, the HLog.LogSyncer is run sync(), in under layer it call the DFSClient.DFSOutputStream.sync(). Then the hlog.closeAndDelete() was called,firstly the HLog.close() will interrupt the LogSyncer, and interrupt DFSClient.DFSOutputStream.sync().The DFSClient.DFSOutputStream will store the exception and throw it when we called DFSClient.close(). The HLog.close() call the writer.close()/DFSClient.close() after interrupt the LogSyncer. And there is no catch exception for the close(). So the Master throw exception to the client. There is no need to throw this exception, further, the hlog is no use. Our cluster is 0.90, the logs is attached, after closing hlog writer, there is no log for the createTable(). The trunk and 0.92, 0.94, we used just one hlog, and if the exception happends, the client will got createTable failed, but indeed ,we expect all the regions for the table can also be assigned. I will give the patch for this later. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6329) Stop META regionserver could cause daughter region assign twice
chunhui shen created HBASE-6329: --- Summary: Stop META regionserver could cause daughter region assign twice Key: HBASE-6329 URL: https://issues.apache.org/jira/browse/HBASE-6329 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen We found this issue in 0.94, first let me describe the case: Stop META rs when split is in progress 1.Stopping META rs(Server A). 2.The main thread of rs close ZK and delete ephemeral node of the rs. 3.SplitTransaction is retring MetaEditor.addDaughter 4.Master's ServerShutdownHandler process the above dead META server 5.Master fixup daughter and assign the daughter 6.The daughter is opened on another server(Server B) 7.Server A's splitTransaction successfully add the daughter to .META. with serverName=Server A 8.Now, in the .META., daughter's region location is Server A but it is onlined on Server B 9.Restart Master, and master will assign the daughter again. Attaching the logs, daughter region 80f999ea84cb259e20e9a228546f6c8a Master log: 2012-07-04 13:45:56,493 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:45:58,983 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. 2012-07-04 13:45:58,985 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=null 2012-07-04 13:45:58,988 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw88.kgb.sqa.cm4,60020,1341379188777 2012-07-04 13:46:00,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. that was online on dw88.kgb.sqa.cm4,60020,1341379188777 Master log after restart: 2012-07-04 14:27:05,824 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x136187d60e34644 Creating (or updating) unassigned node for 80f999ea84cb259e20e9a228546f6c8a with OFFLINE state 2012-07-04 14:27:05,851 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. in state M_ZK_REGION_OFFLINE 2012-07-04 14:27:05,854 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw93.kgb.sqa.cm4,60020,1341380812020 2012-07-04 14:27:06,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=dw93.kgb.sqa.cm4,60020,1341380812020, region=80f999ea84cb259e20e9a228546f6c8a Regionserver(META rs) log: 2012-07-04 13:45:56,491 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server dw93.kgb.sqa.cm4,60020,1341378224464; zookeeper connection c losed. 2012-07-04 13:46:11,951 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:46:11,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy task for region=writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., daughter=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6329) Stop META regionserver could cause daughter region assign twice
[ https://issues.apache.org/jira/browse/HBASE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406834#comment-13406834 ] chunhui shen commented on HBASE-6329: - IMO, regionserver should do closing zk and deleting ephemeral node in main thread after doing join() Stop META regionserver could cause daughter region assign twice --- Key: HBASE-6329 URL: https://issues.apache.org/jira/browse/HBASE-6329 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen We found this issue in 0.94, first let me describe the case: Stop META rs when split is in progress 1.Stopping META rs(Server A). 2.The main thread of rs close ZK and delete ephemeral node of the rs. 3.SplitTransaction is retring MetaEditor.addDaughter 4.Master's ServerShutdownHandler process the above dead META server 5.Master fixup daughter and assign the daughter 6.The daughter is opened on another server(Server B) 7.Server A's splitTransaction successfully add the daughter to .META. with serverName=Server A 8.Now, in the .META., daughter's region location is Server A but it is onlined on Server B 9.Restart Master, and master will assign the daughter again. Attaching the logs, daughter region 80f999ea84cb259e20e9a228546f6c8a Master log: 2012-07-04 13:45:56,493 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:45:58,983 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. 2012-07-04 13:45:58,985 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=null 2012-07-04 13:45:58,988 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw88.kgb.sqa.cm4,60020,1341379188777 2012-07-04 13:46:00,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. that was online on dw88.kgb.sqa.cm4,60020,1341379188777 Master log after restart: 2012-07-04 14:27:05,824 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x136187d60e34644 Creating (or updating) unassigned node for 80f999ea84cb259e20e9a228546f6c8a with OFFLINE state 2012-07-04 14:27:05,851 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. in state M_ZK_REGION_OFFLINE 2012-07-04 14:27:05,854 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw93.kgb.sqa.cm4,60020,1341380812020 2012-07-04 14:27:06,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=dw93.kgb.sqa.cm4,60020,1341380812020, region=80f999ea84cb259e20e9a228546f6c8a Regionserver(META rs) log: 2012-07-04 13:45:56,491 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server dw93.kgb.sqa.cm4,60020,1341378224464; zookeeper connection c losed. 2012-07-04 13:46:11,951 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:46:11,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy task for region=writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., daughter=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6329) Stop META regionserver when splitting region could cause daughter region assign twice
[ https://issues.apache.org/jira/browse/HBASE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-6329: Summary: Stop META regionserver when splitting region could cause daughter region assign twice (was: Stop META regionserver could cause daughter region assign twice) Stop META regionserver when splitting region could cause daughter region assign twice - Key: HBASE-6329 URL: https://issues.apache.org/jira/browse/HBASE-6329 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen We found this issue in 0.94, first let me describe the case: Stop META rs when split is in progress 1.Stopping META rs(Server A). 2.The main thread of rs close ZK and delete ephemeral node of the rs. 3.SplitTransaction is retring MetaEditor.addDaughter 4.Master's ServerShutdownHandler process the above dead META server 5.Master fixup daughter and assign the daughter 6.The daughter is opened on another server(Server B) 7.Server A's splitTransaction successfully add the daughter to .META. with serverName=Server A 8.Now, in the .META., daughter's region location is Server A but it is onlined on Server B 9.Restart Master, and master will assign the daughter again. Attaching the logs, daughter region 80f999ea84cb259e20e9a228546f6c8a Master log: 2012-07-04 13:45:56,493 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:45:58,983 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. 2012-07-04 13:45:58,985 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=null 2012-07-04 13:45:58,988 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw88.kgb.sqa.cm4,60020,1341379188777 2012-07-04 13:46:00,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. that was online on dw88.kgb.sqa.cm4,60020,1341379188777 Master log after restart: 2012-07-04 14:27:05,824 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x136187d60e34644 Creating (or updating) unassigned node for 80f999ea84cb259e20e9a228546f6c8a with OFFLINE state 2012-07-04 14:27:05,851 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. in state M_ZK_REGION_OFFLINE 2012-07-04 14:27:05,854 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw93.kgb.sqa.cm4,60020,1341380812020 2012-07-04 14:27:06,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=dw93.kgb.sqa.cm4,60020,1341380812020, region=80f999ea84cb259e20e9a228546f6c8a Regionserver(META rs) log: 2012-07-04 13:45:56,491 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server dw93.kgb.sqa.cm4,60020,1341378224464; zookeeper connection c losed. 2012-07-04 13:46:11,951 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:46:11,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy task for region=writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., daughter=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406838#comment-13406838 ] ramkrishna.s.vasudevan commented on HBASE-6299: --- @Maryann/Stack I can tell one scenario where this patch will lead to inconsistency. In the patch {code} else { +// The destination region server is probably processing the region open, so it +// might be safer to try this region server again to avoid having two region +// servers open the same region. +LOG.warn(Call openRegion() to + plan.getDestination() + + has timed out when trying to assign + region.getRegionNameAsString() + +. Trying to assign to this region server again; retry= + i, t); +state.update(RegionState.State.OFFLINE); +continue; + } {code} Now because the RS is already opening i tend to assign it to same RS and i update the inmemory state to OFFLINE. At that time the RS has moved the znode from OFFLINE to OPENING or OPENINIG to OPENED. Now there is a check in handleRegion {code} if (regionState == null || (!regionState.isPendingOpen() !regionState.isOpening())) { LOG.warn(Received OPENING for region + prettyPrintedRegionName + from server + data.getOrigin() + but region was in + the state + regionState + and not + in expected PENDING_OPEN or OPENING states); return; } {code} So the master skips the transition. Now any way as we are trying out the assignment to same RS, we will either get RegionAlreadyInTransistion or sometimes even ALREADY_OPENED. If i get ALREADY_OPENED we are handling it correctly by adding to this.regions. But if i get RegionAlreadyInTransistion we just skip the assign next time. Now in the RS side the region could have been made online by this time but the master is not aware of this. One more thing is {code} +else if (t instanceof java.net.SocketTimeoutException) { + if (this.regionsInTransition.get(region.getEncodedName()) == null + plan.getDestination().equals(getRegionServerOfRegion(region))) { {code} Here the plan could be cleared on regionOnline if RIT is cleared? Ideally over in HBASE-6060 we were trying to see how good is the retry option in assign. Sometimes the retry option and SSH were causing double assignments which we were trying to solve. Here, Can we have an option to shutdown the RS incase of sockettimeout by the master so that atleast we are sure that the SSH will take care of ssignment? RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to
[jira] [Commented] (HBASE-6311) Data error after majorCompaction caused by keeping MVCC for opened scanners
[ https://issues.apache.org/jira/browse/HBASE-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406839#comment-13406839 ] ramkrishna.s.vasudevan commented on HBASE-6311: --- @All, Could someone take a look at this? Seems important wrt MVCC. Data error after majorCompaction caused by keeping MVCC for opened scanners --- Key: HBASE-6311 URL: https://issues.apache.org/jira/browse/HBASE-6311 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Priority: Blocker Attachments: HBASE-6311-test.patch, HBASE-6311v1.patch It is a big problem we found in 0.94, and you could reproduce the problem in Trunk using the test case I uploaded. When we do compaction, we will use region.getSmallestReadPoint() to keep MVCC for opened scanners; However,It will make data mistake after majorCompaction because we will skip delete type KV but keep the put type kv in the compacted storefile. The following is the reason from code: In StoreFileScanner, enforceMVCC is false when compaction, so we could read the delete type KV, However, we will skip this delete type KV in ScanQueryMatcher because following code {code} if (kv.isDelete()) { ... if (includeDeleteMarker kv.getMemstoreTS() = maxReadPointToTrackVersions) { System.out.println(add deletes,maxReadPointToTrackVersions= + maxReadPointToTrackVersions); this.deletes.add(bytes, offset, qualLength, timestamp, type); } ... } {code} Here maxReadPointToTrackVersions = region.getSmallestReadPoint(); and kv.getMemstoreTS() maxReadPointToTrackVersions So we won't add this to DeleteTracker. Why test case passed if remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Because in the StoreFileScanner#skipKVsNewerThanReadpoint {code} if (cur.getMemstoreTS() = readPoint) { cur.setMemstoreTS(0); } {code} So if we remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Here readPoint is LONG.MAX_VALUE, we will set memStore ts as 0, so we will add it to DeleteTracker in ScanQueryMatcher Solution: We use smallestReadPoint of region when compaction to keep MVCC for OPENED scanner, So we should retain delete type kv in output in the case(Already deleted KV is retained in output to make old opened scanner could read this KV) even if it is a majorcompaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6329) Stop META regionserver when splitting region could cause daughter region assign twice
[ https://issues.apache.org/jira/browse/HBASE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-6329: Attachment: HBASE-6329v1.patch Stop META regionserver when splitting region could cause daughter region assign twice - Key: HBASE-6329 URL: https://issues.apache.org/jira/browse/HBASE-6329 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-6329v1.patch We found this issue in 0.94, first let me describe the case: Stop META rs when split is in progress 1.Stopping META rs(Server A). 2.The main thread of rs close ZK and delete ephemeral node of the rs. 3.SplitTransaction is retring MetaEditor.addDaughter 4.Master's ServerShutdownHandler process the above dead META server 5.Master fixup daughter and assign the daughter 6.The daughter is opened on another server(Server B) 7.Server A's splitTransaction successfully add the daughter to .META. with serverName=Server A 8.Now, in the .META., daughter's region location is Server A but it is onlined on Server B 9.Restart Master, and master will assign the daughter again. Attaching the logs, daughter region 80f999ea84cb259e20e9a228546f6c8a Master log: 2012-07-04 13:45:56,493 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:45:58,983 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. 2012-07-04 13:45:58,985 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=null 2012-07-04 13:45:58,988 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw88.kgb.sqa.cm4,60020,1341379188777 2012-07-04 13:46:00,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. that was online on dw88.kgb.sqa.cm4,60020,1341379188777 Master log after restart: 2012-07-04 14:27:05,824 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x136187d60e34644 Creating (or updating) unassigned node for 80f999ea84cb259e20e9a228546f6c8a with OFFLINE state 2012-07-04 14:27:05,851 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. in state M_ZK_REGION_OFFLINE 2012-07-04 14:27:05,854 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw93.kgb.sqa.cm4,60020,1341380812020 2012-07-04 14:27:06,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=dw93.kgb.sqa.cm4,60020,1341380812020, region=80f999ea84cb259e20e9a228546f6c8a Regionserver(META rs) log: 2012-07-04 13:45:56,491 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server dw93.kgb.sqa.cm4,60020,1341378224464; zookeeper connection c losed. 2012-07-04 13:46:11,951 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:46:11,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy task for region=writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., daughter=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6329) Stop META regionserver when splitting region could cause daughter region assign twice
[ https://issues.apache.org/jira/browse/HBASE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6329: -- Status: Patch Available (was: Open) Stop META regionserver when splitting region could cause daughter region assign twice - Key: HBASE-6329 URL: https://issues.apache.org/jira/browse/HBASE-6329 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-6329v1.patch We found this issue in 0.94, first let me describe the case: Stop META rs when split is in progress 1.Stopping META rs(Server A). 2.The main thread of rs close ZK and delete ephemeral node of the rs. 3.SplitTransaction is retring MetaEditor.addDaughter 4.Master's ServerShutdownHandler process the above dead META server 5.Master fixup daughter and assign the daughter 6.The daughter is opened on another server(Server B) 7.Server A's splitTransaction successfully add the daughter to .META. with serverName=Server A 8.Now, in the .META., daughter's region location is Server A but it is onlined on Server B 9.Restart Master, and master will assign the daughter again. Attaching the logs, daughter region 80f999ea84cb259e20e9a228546f6c8a Master log: 2012-07-04 13:45:56,493 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:45:58,983 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. 2012-07-04 13:45:58,985 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=null 2012-07-04 13:45:58,988 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw88.kgb.sqa.cm4,60020,1341379188777 2012-07-04 13:46:00,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. that was online on dw88.kgb.sqa.cm4,60020,1341379188777 Master log after restart: 2012-07-04 14:27:05,824 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x136187d60e34644 Creating (or updating) unassigned node for 80f999ea84cb259e20e9a228546f6c8a with OFFLINE state 2012-07-04 14:27:05,851 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. in state M_ZK_REGION_OFFLINE 2012-07-04 14:27:05,854 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw93.kgb.sqa.cm4,60020,1341380812020 2012-07-04 14:27:06,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=dw93.kgb.sqa.cm4,60020,1341380812020, region=80f999ea84cb259e20e9a228546f6c8a Regionserver(META rs) log: 2012-07-04 13:45:56,491 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server dw93.kgb.sqa.cm4,60020,1341378224464; zookeeper connection c losed. 2012-07-04 13:46:11,951 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:46:11,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy task for region=writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., daughter=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6311) Data error after majorCompaction caused by keeping MVCC for opened scanners
[ https://issues.apache.org/jira/browse/HBASE-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-6311: Attachment: HBASE-6311v2.patch @ram What doubt do you have about my patch v2? I update the test case to verify MVCC for scanners after majorCompaction. Data error after majorCompaction caused by keeping MVCC for opened scanners --- Key: HBASE-6311 URL: https://issues.apache.org/jira/browse/HBASE-6311 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Priority: Blocker Attachments: HBASE-6311-test.patch, HBASE-6311v1.patch, HBASE-6311v2.patch It is a big problem we found in 0.94, and you could reproduce the problem in Trunk using the test case I uploaded. When we do compaction, we will use region.getSmallestReadPoint() to keep MVCC for opened scanners; However,It will make data mistake after majorCompaction because we will skip delete type KV but keep the put type kv in the compacted storefile. The following is the reason from code: In StoreFileScanner, enforceMVCC is false when compaction, so we could read the delete type KV, However, we will skip this delete type KV in ScanQueryMatcher because following code {code} if (kv.isDelete()) { ... if (includeDeleteMarker kv.getMemstoreTS() = maxReadPointToTrackVersions) { System.out.println(add deletes,maxReadPointToTrackVersions= + maxReadPointToTrackVersions); this.deletes.add(bytes, offset, qualLength, timestamp, type); } ... } {code} Here maxReadPointToTrackVersions = region.getSmallestReadPoint(); and kv.getMemstoreTS() maxReadPointToTrackVersions So we won't add this to DeleteTracker. Why test case passed if remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Because in the StoreFileScanner#skipKVsNewerThanReadpoint {code} if (cur.getMemstoreTS() = readPoint) { cur.setMemstoreTS(0); } {code} So if we remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Here readPoint is LONG.MAX_VALUE, we will set memStore ts as 0, so we will add it to DeleteTracker in ScanQueryMatcher Solution: We use smallestReadPoint of region when compaction to keep MVCC for OPENED scanner, So we should retain delete type kv in output in the case(Already deleted KV is retained in output to make old opened scanner could read this KV) even if it is a majorcompaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6311) Data error after majorCompaction caused by keeping MVCC for opened scanners
[ https://issues.apache.org/jira/browse/HBASE-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406854#comment-13406854 ] ramkrishna.s.vasudevan commented on HBASE-6311: --- @Chunhui I am clear with your patch. Your patch tries to keep MVCC concepts intact and that is what is needed. No problem. Even Anoop also has reviewed it. Just wanted others to review this because now even on major compaction we create a file with delete marker if the condition is kv.getMemstoreTS() maxReadPointToTrackVersions. But in a normal case we will not write delete marker on major compaction. Is this ok? Data error after majorCompaction caused by keeping MVCC for opened scanners --- Key: HBASE-6311 URL: https://issues.apache.org/jira/browse/HBASE-6311 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Priority: Blocker Attachments: HBASE-6311-test.patch, HBASE-6311v1.patch, HBASE-6311v2.patch It is a big problem we found in 0.94, and you could reproduce the problem in Trunk using the test case I uploaded. When we do compaction, we will use region.getSmallestReadPoint() to keep MVCC for opened scanners; However,It will make data mistake after majorCompaction because we will skip delete type KV but keep the put type kv in the compacted storefile. The following is the reason from code: In StoreFileScanner, enforceMVCC is false when compaction, so we could read the delete type KV, However, we will skip this delete type KV in ScanQueryMatcher because following code {code} if (kv.isDelete()) { ... if (includeDeleteMarker kv.getMemstoreTS() = maxReadPointToTrackVersions) { System.out.println(add deletes,maxReadPointToTrackVersions= + maxReadPointToTrackVersions); this.deletes.add(bytes, offset, qualLength, timestamp, type); } ... } {code} Here maxReadPointToTrackVersions = region.getSmallestReadPoint(); and kv.getMemstoreTS() maxReadPointToTrackVersions So we won't add this to DeleteTracker. Why test case passed if remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Because in the StoreFileScanner#skipKVsNewerThanReadpoint {code} if (cur.getMemstoreTS() = readPoint) { cur.setMemstoreTS(0); } {code} So if we remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Here readPoint is LONG.MAX_VALUE, we will set memStore ts as 0, so we will add it to DeleteTracker in ScanQueryMatcher Solution: We use smallestReadPoint of region when compaction to keep MVCC for OPENED scanner, So we should retain delete type kv in output in the case(Already deleted KV is retained in output to make old opened scanner could read this KV) even if it is a majorcompaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6329) Stop META regionserver when splitting region could cause daughter region assign twice
[ https://issues.apache.org/jira/browse/HBASE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406858#comment-13406858 ] ramkrishna.s.vasudevan commented on HBASE-6329: --- Nice one. One question here {code} // Interrupt catalog tracker here in case any regions being opened out in // handlers are stuck waiting on meta or root. if (this.catalogTracker != null) this.catalogTracker.stop(); {code} This does not impact the thread that is trying to write into META thro SplitTransaction? May be we can add one check like if RS already aborting do not call abort/stop. This is because some times in the above case if META writing fails we will get a PONR and thro PONR we will call server.abort. Now already there is an abort going on and one more abort will be called. Not sure of the implications if both go on at the same time. Stop META regionserver when splitting region could cause daughter region assign twice - Key: HBASE-6329 URL: https://issues.apache.org/jira/browse/HBASE-6329 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-6329v1.patch We found this issue in 0.94, first let me describe the case: Stop META rs when split is in progress 1.Stopping META rs(Server A). 2.The main thread of rs close ZK and delete ephemeral node of the rs. 3.SplitTransaction is retring MetaEditor.addDaughter 4.Master's ServerShutdownHandler process the above dead META server 5.Master fixup daughter and assign the daughter 6.The daughter is opened on another server(Server B) 7.Server A's splitTransaction successfully add the daughter to .META. with serverName=Server A 8.Now, in the .META., daughter's region location is Server A but it is onlined on Server B 9.Restart Master, and master will assign the daughter again. Attaching the logs, daughter region 80f999ea84cb259e20e9a228546f6c8a Master log: 2012-07-04 13:45:56,493 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:45:58,983 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. 2012-07-04 13:45:58,985 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=null 2012-07-04 13:45:58,988 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw88.kgb.sqa.cm4,60020,1341379188777 2012-07-04 13:46:00,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. that was online on dw88.kgb.sqa.cm4,60020,1341379188777 Master log after restart: 2012-07-04 14:27:05,824 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x136187d60e34644 Creating (or updating) unassigned node for 80f999ea84cb259e20e9a228546f6c8a with OFFLINE state 2012-07-04 14:27:05,851 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. in state M_ZK_REGION_OFFLINE 2012-07-04 14:27:05,854 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw93.kgb.sqa.cm4,60020,1341380812020 2012-07-04 14:27:06,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=dw93.kgb.sqa.cm4,60020,1341380812020, region=80f999ea84cb259e20e9a228546f6c8a Regionserver(META rs) log: 2012-07-04 13:45:56,491 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server dw93.kgb.sqa.cm4,60020,1341378224464; zookeeper connection c losed. 2012-07-04 13:46:11,951 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:46:11,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy task for
[jira] [Commented] (HBASE-6311) Data error after majorCompaction caused by keeping MVCC for opened scanners
[ https://issues.apache.org/jira/browse/HBASE-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406861#comment-13406861 ] chunhui shen commented on HBASE-6311: - bq.But in a normal case we will not write delete marker on major compaction Yes, it's so Data error after majorCompaction caused by keeping MVCC for opened scanners --- Key: HBASE-6311 URL: https://issues.apache.org/jira/browse/HBASE-6311 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Priority: Blocker Attachments: HBASE-6311-test.patch, HBASE-6311v1.patch, HBASE-6311v2.patch It is a big problem we found in 0.94, and you could reproduce the problem in Trunk using the test case I uploaded. When we do compaction, we will use region.getSmallestReadPoint() to keep MVCC for opened scanners; However,It will make data mistake after majorCompaction because we will skip delete type KV but keep the put type kv in the compacted storefile. The following is the reason from code: In StoreFileScanner, enforceMVCC is false when compaction, so we could read the delete type KV, However, we will skip this delete type KV in ScanQueryMatcher because following code {code} if (kv.isDelete()) { ... if (includeDeleteMarker kv.getMemstoreTS() = maxReadPointToTrackVersions) { System.out.println(add deletes,maxReadPointToTrackVersions= + maxReadPointToTrackVersions); this.deletes.add(bytes, offset, qualLength, timestamp, type); } ... } {code} Here maxReadPointToTrackVersions = region.getSmallestReadPoint(); and kv.getMemstoreTS() maxReadPointToTrackVersions So we won't add this to DeleteTracker. Why test case passed if remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Because in the StoreFileScanner#skipKVsNewerThanReadpoint {code} if (cur.getMemstoreTS() = readPoint) { cur.setMemstoreTS(0); } {code} So if we remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Here readPoint is LONG.MAX_VALUE, we will set memStore ts as 0, so we will add it to DeleteTracker in ScanQueryMatcher Solution: We use smallestReadPoint of region when compaction to keep MVCC for OPENED scanner, So we should retain delete type kv in output in the case(Already deleted KV is retained in output to make old opened scanner could read this KV) even if it is a majorcompaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6329) Stop META regionserver when splitting region could cause daughter region assign twice
[ https://issues.apache.org/jira/browse/HBASE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406860#comment-13406860 ] Hadoop QA commented on HBASE-6329: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12535146/HBASE-6329v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 7 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol org.apache.hadoop.hbase.regionserver.wal.TestHLog Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2324//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2324//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2324//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2324//console This message is automatically generated. Stop META regionserver when splitting region could cause daughter region assign twice - Key: HBASE-6329 URL: https://issues.apache.org/jira/browse/HBASE-6329 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-6329v1.patch We found this issue in 0.94, first let me describe the case: Stop META rs when split is in progress 1.Stopping META rs(Server A). 2.The main thread of rs close ZK and delete ephemeral node of the rs. 3.SplitTransaction is retring MetaEditor.addDaughter 4.Master's ServerShutdownHandler process the above dead META server 5.Master fixup daughter and assign the daughter 6.The daughter is opened on another server(Server B) 7.Server A's splitTransaction successfully add the daughter to .META. with serverName=Server A 8.Now, in the .META., daughter's region location is Server A but it is onlined on Server B 9.Restart Master, and master will assign the daughter again. Attaching the logs, daughter region 80f999ea84cb259e20e9a228546f6c8a Master log: 2012-07-04 13:45:56,493 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for dw93.kgb.sqa.cm4,60020,1341378224464 2012-07-04 13:45:58,983 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. 2012-07-04 13:45:58,985 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a., serverName=null 2012-07-04 13:45:58,988 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. to dw88.kgb.sqa.cm4,60020,1341379188777 2012-07-04 13:46:00,201 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. that was online on dw88.kgb.sqa.cm4,60020,1341379188777 Master log after restart: 2012-07-04 14:27:05,824 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x136187d60e34644 Creating (or updating) unassigned node for 80f999ea84cb259e20e9a228546f6c8a with OFFLINE state 2012-07-04 14:27:05,851 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region writetest,JC\xCA\xC8\xCFQ\xC49OH\xCEV\xCC\xC2\xB5\xC2@\xD4,1341380730558.80f999ea84cb259e20e9a228546f6c8a. in state M_ZK_REGION_OFFLINE 2012-07-04 14:27:05,854 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406862#comment-13406862 ] Maryann Xue commented on HBASE-6299: Agree, ramkrishna! You've made a good point here. My original idea was to directly return in the else branch, and leave it to the TimeoutMonitor to assign this region if the RS did not open the region. I changed to the current version, thinking to bring the assign retrial earlier. But regarding the region in transition problem you pointed out, the original return solution looks better. {code} else { +// The destination region server is probably processing the region open, so it +// might be safer to try this region server again to avoid having two region +// servers open the same region. +LOG.error(Call openRegion() to + plan.getDestination() + + has timed out when trying to assign + region.getRegionNameAsString() + +., t); +return; + } {code} And if we are considering removing the assign retry in HBASE-6060, problems like this one and the one in HBASE-5816 can be avoided. Think triggering SSH in case of SocketTimeout should be a different problem. There are several places in HMaster where we should consider whether to start SSH, but currently only RegionServerTracker will start SSH. Shall we open another JIRA entry to discuss this issue? RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406863#comment-13406863 ] ramkrishna.s.vasudevan commented on HBASE-6299: --- @Maryann bq.And if we are considering removing the assign retry in HBASE-6060 Assign retry was a point discussed over there, but still not concluded on removing it. bq.Shall we open another JIRA entry to discuss this issue? Yes...sure...Stack, Jon and others have started to work on issues related to Assignments recently. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed
[jira] [Updated] (HBASE-6311) Data error after majorCompaction caused by keeping MVCC for opened scanners
[ https://issues.apache.org/jira/browse/HBASE-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Ted Yu updated HBASE-6311: -- Fix Version/s: 0.94.1 0.96.0 Hadoop Flags: Reviewed Status: Patch Available (was: Open) Data error after majorCompaction caused by keeping MVCC for opened scanners --- Key: HBASE-6311 URL: https://issues.apache.org/jira/browse/HBASE-6311 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Priority: Blocker Fix For: 0.96.0, 0.94.1 Attachments: HBASE-6311-test.patch, HBASE-6311v1.patch, HBASE-6311v2.patch It is a big problem we found in 0.94, and you could reproduce the problem in Trunk using the test case I uploaded. When we do compaction, we will use region.getSmallestReadPoint() to keep MVCC for opened scanners; However,It will make data mistake after majorCompaction because we will skip delete type KV but keep the put type kv in the compacted storefile. The following is the reason from code: In StoreFileScanner, enforceMVCC is false when compaction, so we could read the delete type KV, However, we will skip this delete type KV in ScanQueryMatcher because following code {code} if (kv.isDelete()) { ... if (includeDeleteMarker kv.getMemstoreTS() = maxReadPointToTrackVersions) { System.out.println(add deletes,maxReadPointToTrackVersions= + maxReadPointToTrackVersions); this.deletes.add(bytes, offset, qualLength, timestamp, type); } ... } {code} Here maxReadPointToTrackVersions = region.getSmallestReadPoint(); and kv.getMemstoreTS() maxReadPointToTrackVersions So we won't add this to DeleteTracker. Why test case passed if remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Because in the StoreFileScanner#skipKVsNewerThanReadpoint {code} if (cur.getMemstoreTS() = readPoint) { cur.setMemstoreTS(0); } {code} So if we remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint); Here readPoint is LONG.MAX_VALUE, we will set memStore ts as 0, so we will add it to DeleteTracker in ScanQueryMatcher Solution: We use smallestReadPoint of region when compaction to keep MVCC for OPENED scanner, So we should retain delete type kv in output in the case(Already deleted KV is retained in output to make old opened scanner could read this KV) even if it is a majorcompaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira