[jira] [Created] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour
NOT flushing memstore for region keep on printing for half an hour - Key: HBASE-3821 URL: https://issues.apache.org/jira/browse/HBASE-3821 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.1 Reporter: zhoushuaifeng Fix For: 0.90.3 NOT flushing memstore for region keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling. I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false and may run close preflush. This will make flush fail and print NOT flushing memstore for region. But It should be finished after a while. logs: 2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 9ms 2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec 2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions flushes 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore size 105.6m 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false .. 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. Mon Apr 18 17:04:24 IST 2011 Starting regionserver on linux253 ulimit -n 1024 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour
[ https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025103#comment-13025103 ] zhoushuaifeng commented on HBASE-3821: -- I think the problem is like this: 1, When split region, it will close parent region, and set writestate.writesEnabled = false: private ListStoreFile doClose(final boolean abort) throws IOException { synchronized (writestate) { // Disable compacting and flushing by background threads for this // region. writestate.writesEnabled = false; 2, If the memstore is large enouth, preflush will happen: if (!abort !wasFlushing worthPreFlushing()) { LOG.info(Running close preflush of + this.getRegionNameAsString()); internalFlushcache(); } this.closing.set(true); lock.writeLock().lock(); 3, IOException happened, and preflushing failed, and closing parent failed: createSplitDir(this.parent.getFilesystem(), this.splitdir); this.journal.add(JournalEntry.CREATE_SPLIT_DIR); ListStoreFile hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { 4, roll back split is calling, but split state stay in CREATE_SPLIT_DIR, so , only clenupSplitDir will happen. while (iterator.hasPrevious()) { JournalEntry je = iterator.previous(); switch(je) { case CREATE_SPLIT_DIR: cleanupSplitDir(fs, this.splitdir); break; case CLOSED_PARENT_REGION: 5, what about writestate.writesEnabled? it stayed in false, no one handle it. So, even split is roll back, but no flush can success in parent region. NOT flushing memstore for region keep on printing for half an hour - Key: HBASE-3821 URL: https://issues.apache.org/jira/browse/HBASE-3821 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.1 Reporter: zhoushuaifeng Fix For: 0.90.3 NOT flushing memstore for region keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling. I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false and may run close preflush. This will make flush fail and print NOT flushing memstore for region. But It should be finished after a while. logs: 2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 9ms 2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec 2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions flushes 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,771 DEBUG
[jira] [Updated] (HBASE-3629) Update our thrift to 0.6
[ https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Moaz Reyad updated HBASE-3629: -- Attachment: HBASE-3629.patch.zip Here is the generated files using thrift 0.6.1 plus the pom changes and little fixes in ThriftServer.java. Not sure if the hadoop-non-releases is still needed or it can be also removed from pom. Update our thrift to 0.6 Key: HBASE-3629 URL: https://issues.apache.org/jira/browse/HBASE-3629 Project: HBase Issue Type: Task Reporter: stack Assignee: Moaz Reyad Attachments: HBASE-3629.patch.zip, pom.diff HBASE-3117 was about updating to 0.5. Moaz Reyad over in that issue is trying to move us to 0.6. Lets move the 0.6 upgrade effort here. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1744) Thrift server to match the new java api.
[ https://issues.apache.org/jira/browse/HBASE-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025153#comment-13025153 ] Lars Francke commented on HBASE-1744: - https://issues.apache.org/jira/browse/THRIFT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13024799#comment-13024799 Thrift is finally in the Maven repository! Thrift server to match the new java api. Key: HBASE-1744 URL: https://issues.apache.org/jira/browse/HBASE-1744 Project: HBase Issue Type: Improvement Components: thrift Reporter: Tim Sell Assignee: Lars Francke Priority: Critical Fix For: 0.92.0 Attachments: HBASE-1744.2.patch, HBASE-1744.preview.1.patch, thriftexperiment.patch This mutateRows, etc.. is a little confusing compared to the new cleaner java client. Thinking of ways to make a thrift client that is just as elegant. something like: void put(1:Bytes table, 2:TPut put) throws (1:IOError io) with: struct TColumn { 1:Bytes family, 2:Bytes qualifier, 3:i64 timestamp } struct TPut { 1:Bytes row, 2:mapTColumn, Bytes values } This creates more verbose rpc than if the columns in TPut were just mapBytes, mapBytes, Bytes, but that is harder to fit timestamps into and still be intuitive from say python. Presumably the goal of a thrift gateway is to be easy first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1744) Thrift server to match the new java api.
[ https://issues.apache.org/jira/browse/HBASE-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025249#comment-13025249 ] stack commented on HBASE-1744: -- @Lars Some fellas figured it ahead of you (smile). See https://mail.google.com/mail/?shva=1#label/hbase-issues/12f8cf7db44d14fb Thrift server to match the new java api. Key: HBASE-1744 URL: https://issues.apache.org/jira/browse/HBASE-1744 Project: HBase Issue Type: Improvement Components: thrift Reporter: Tim Sell Assignee: Lars Francke Priority: Critical Fix For: 0.92.0 Attachments: HBASE-1744.2.patch, HBASE-1744.preview.1.patch, thriftexperiment.patch This mutateRows, etc.. is a little confusing compared to the new cleaner java client. Thinking of ways to make a thrift client that is just as elegant. something like: void put(1:Bytes table, 2:TPut put) throws (1:IOError io) with: struct TColumn { 1:Bytes family, 2:Bytes qualifier, 3:i64 timestamp } struct TPut { 1:Bytes row, 2:mapTColumn, Bytes values } This creates more verbose rpc than if the columns in TPut were just mapBytes, mapBytes, Bytes, but that is harder to fit timestamps into and still be intuitive from say python. Presumably the goal of a thrift gateway is to be easy first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3629) Update our thrift to 0.6
[ https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025257#comment-13025257 ] Lars Francke commented on HBASE-3629: - Thanks for doing this! Please remove the central repository again though. And the finalName entry can be removed as well as it is inherited anyway. Can't alter the patch right now but can do tomorrow if needed. Also the comment about the newer version doesn't apply anymore. Update our thrift to 0.6 Key: HBASE-3629 URL: https://issues.apache.org/jira/browse/HBASE-3629 Project: HBase Issue Type: Task Reporter: stack Assignee: Moaz Reyad Attachments: HBASE-3629.patch.zip, pom.diff HBASE-3117 was about updating to 0.5. Moaz Reyad over in that issue is trying to move us to 0.6. Lets move the 0.6 upgrade effort here. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-3822) region server stuck in waitOnAllRegionsToClose
region server stuck in waitOnAllRegionsToClose -- Key: HBASE-3822 URL: https://issues.apache.org/jira/browse/HBASE-3822 Project: HBase Issue Type: Bug Reporter: Prakash Khemani The regionserver is not able to exit because the rs thread is stuck here regionserver60020 prio=10 tid=0x2ab2b039e000 nid=0x760a waiting on condition [0x4365e000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer.waitOnAllRegionsToClose(HRegionServer.java:736) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:689) at java.lang.Thread.run(Thread.java:619) === In CloseRegionHandler.process() we do not call removeFromOnlineRegions() if there is an exception. (In this case I suspect there was a log-rolling exception because of another issue) // Close the region try { // TODO: If we need to keep updating CLOSING stamp to prevent against // a timeout if this is long-running, need to spin up a thread? if (region.close(abort) == null) { // This region got closed. Most likely due to a split. So instead // of doing the setClosedState() below, let's just ignore and continue. // The split message will clean up the master state. LOG.warn(Can't close region: was already closed during close(): + regionInfo.getRegionNameAsString()); return; } } catch (IOException e) { LOG.error(Unrecoverable exception while closing region + regionInfo.getRegionNameAsString() + , still finishing close, e); } this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName()); === I think we set the closing flag on the region, it won't be taking any more requests, it is as good as offline. Either we should refine the check in waitOnAllRegionsToClose() or CloseRegionHandler.process() should remove the region from online-regions set. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour
[ https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025363#comment-13025363 ] stack commented on HBASE-3821: -- Excellent digging Zhou! Yes if the preflush fails, we need to reset the work done by 'writestate.writesEnabled = false;' If you have a patch, that'd be great. Was the fail to split a transitory error? Were you able to flush memory successfully later? NOT flushing memstore for region keep on printing for half an hour - Key: HBASE-3821 URL: https://issues.apache.org/jira/browse/HBASE-3821 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.1 Reporter: zhoushuaifeng Fix For: 0.90.3 NOT flushing memstore for region keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling. I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false and may run close preflush. This will make flush fail and print NOT flushing memstore for region. But It should be finished after a while. logs: 2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 9ms 2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec 2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions flushes 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore size 105.6m 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false .. 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on
[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss
[ https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025364#comment-13025364 ] stack commented on HBASE-3065: -- Ping Liyin! Retry all 'retryable' zk operations; e.g. connection loss - Key: HBASE-3065 URL: https://issues.apache.org/jira/browse/HBASE-3065 Project: HBase Issue Type: Bug Reporter: stack Assignee: Liyin Tang Fix For: 0.92.0 Attachments: HBase-3065[r1088475]_1.patch The 'new' master refactored our zk code tidying up all zk accesses and coralling them behind nice zk utility classes. One improvement was letting out all KeeperExceptions letting the client deal. Thats good generally because in old days, we'd suppress important state zk changes in state. But there is at least one case the new zk utility could handle for the application and thats the class of retryable KeeperExceptions. The one that comes to mind is conection loss. On connection loss we should retry the just-failed operation. Usually the retry will just work. At worse, on reconnect, we'll pick up the expired session event. Adding in this change shouldn't be too bad given the refactor of zk corralled all zk access into one or two classes only. One thing to consider though is how much we should retry. We could retry on a timer or we could retry for ever as long as the Stoppable interface is passed so if another thread has stopped or aborted the hosting service, we'll notice and give up trying. Doing the latter is probably better than some kinda timeout. HBASE-3062 adds a timed retry on the first zk operation. This issue is about generalizing what is over there across all zk access. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
[ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025366#comment-13025366 ] stack commented on HBASE-3674: -- Committed to TRUNK. Thanks for review Prakash. Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart -- Key: HBASE-3674 URL: https://issues.apache.org/jira/browse/HBASE-3674 Project: HBase Issue Type: Bug Components: wal Reporter: stack Assignee: stack Priority: Critical Fix For: 0.90.2 Attachments: 3674-distributed.txt, 3674-v2.txt, 3674.txt In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs. On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on. Here is the splitLog method (edited): {code} private ListPath splitLog(final FileStatus[] logfiles) throws IOException { outputSink.startWriterThreads(entryBuffers); try { int i = 0; for (FileStatus log : logfiles) { Path logPath = log.getPath(); long logLength = log.getLen(); splitSize += logLength; LOG.debug(Splitting hlog + (i++ + 1) + of + logfiles.length + : + logPath + , length= + logLength); try { recoverFileLease(fs, logPath, conf); parseHLog(log, entryBuffers, fs, conf); processedLogs.add(logPath); } catch (EOFException eof) { // truncated files are expected if a RS crashes (see HBASE-2643) LOG.info(EOF from hlog + logPath + . Continuing); processedLogs.add(logPath); } catch (FileNotFoundException fnfe) { // A file may be missing if the region server was able to archive it // before shutting down. This means the edits were persisted already LOG.info(A log was missing + logPath + , probably because it was moved by the + now dead region server. Continuing); processedLogs.add(logPath); } catch (IOException e) { // If the IOE resulted from bad file format, // then this problem is idempotent and retrying won't help if (e.getCause() instanceof ParseException || e.getCause() instanceof ChecksumException) { LOG.warn(ParseException from hlog + logPath + . continuing); processedLogs.add(logPath); } else { if (skipErrors) { LOG.info(Got while parsing hlog + logPath + . Marking as corrupted, e); corruptedLogs.add(logPath); } else { throw e; } } } } if (fs.listStatus(srcDir).length processedLogs.size() + corruptedLogs.size()) { throw new OrphanHLogAfterSplitException( Discovered orphan hlog after split. Maybe the + HRegionServer was not dead when we started); } archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf); } finally { splits = outputSink.finishWritingAndClose(); } return splits; } {code} Notice how we'll only archive logs only if we successfully split all logs. We won't archive 31 of 35 files if we happen to get a checksum exception on file 32. I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt). Here is a play-by-play from the logs: {code} 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150 65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=00cd00502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374 6174735f6275636b6574000d9fa4d5dc012ec9c7cbaf000001006d005d0008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564
[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss
[ https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025369#comment-13025369 ] Liyin Tang commented on HBASE-3065: --- Hi Stack I am so sorry for the delay:) I will fix this and submit a new patch~~ Thanks for the review:) Retry all 'retryable' zk operations; e.g. connection loss - Key: HBASE-3065 URL: https://issues.apache.org/jira/browse/HBASE-3065 Project: HBase Issue Type: Bug Reporter: stack Assignee: Liyin Tang Fix For: 0.92.0 Attachments: HBase-3065[r1088475]_1.patch The 'new' master refactored our zk code tidying up all zk accesses and coralling them behind nice zk utility classes. One improvement was letting out all KeeperExceptions letting the client deal. Thats good generally because in old days, we'd suppress important state zk changes in state. But there is at least one case the new zk utility could handle for the application and thats the class of retryable KeeperExceptions. The one that comes to mind is conection loss. On connection loss we should retry the just-failed operation. Usually the retry will just work. At worse, on reconnect, we'll pick up the expired session event. Adding in this change shouldn't be too bad given the refactor of zk corralled all zk access into one or two classes only. One thing to consider though is how much we should retry. We could retry on a timer or we could retry for ever as long as the Stoppable interface is passed so if another thread has stopped or aborted the hosting service, we'll notice and give up trying. Doing the latter is probably better than some kinda timeout. HBASE-3062 adds a timed retry on the first zk operation. This issue is about generalizing what is over there across all zk access. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3777) Redefine Identity Of HBase Configuration
[ https://issues.apache.org/jira/browse/HBASE-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025368#comment-13025368 ] jirapos...@reviews.apache.org commented on HBASE-3777: -- bq. On 2011-04-25 20:05:54, Michael Stack wrote: bq. src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 259 bq. https://reviews.apache.org/r/643/diff/3/?file=16912#file16912line259 bq. bq. Yeah, this is ugly its almost as though you should have a special method for it, one that does not up the counters? bq. bq. Karthick Sankarachary wrote: bq. Just a thought - how about if we hide the ugliness in HCM, like so: bq. bq.public abstract class ConnectableT { bq. public Configuration conf; bq. bq. public Connectable(Configuration conf) { bq.this.conf = conf; bq. } bq. bq. public abstract T connect(Connection connection); bq.} bq. bq.public static T T execute(ConnectableT connectable) { bq. if (connectable == null || connectable.conf == null) { bq.return null; bq. } bq. HConfiguration conf = connectable.conf; bq. HConnection connection = HConnectionManager.getConnection(conf); bq. try { bq.return connectable.connect(connection); bq. } finally { bq.HConnectionManager.deleteConnection(conf, false); bq. } bq.} bq. bq. That way, the HTable call would look somewhat prettier: bq. bq.HConnectionManager.execute(new ConnectableBoolean(conf) { bq. public Boolean connect(Connection connection) { bq.return connection.isTableEnabled(tableName); bq. } bq.}); bq. bq. Karthick Sankarachary wrote: bq. BTW, if we bypass the reference counters in this situation, there's a chance, albeit small, that the connection might get closed by someone else while this guy is still trying to talk to it, which could result in a connection is closed type of error. Your proposal is also ugly but I think less ugly than what we currently have so I would prefer it; it has the benefit of moving the ref counting back into HCM, not letting it out of the class (I'm fine w/ all your other comments Karthick) - Michael --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/643/#review543 --- On 2011-04-22 21:16:59, Karthick Sankarachary wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/643/ bq. --- bq. bq. (Updated 2011-04-22 21:16:59) bq. bq. bq. Review request for hbase and Ted Yu. bq. bq. bq. Summary bq. --- bq. bq. Judging from the javadoc in HConnectionManager, sharing connections across multiple clients going to the same cluster is supposedly a good thing. However, the fact that there is a one-to-one mapping between a configuration and connection instance, kind of works against that goal. Specifically, when you create HTable instances using a given Configuration instance and a copy thereof, we end up with two distinct HConnection instances under the covers. Is this really expected behavior, especially given that the configuration instance gets cloned a lot? bq. bq. Here, I'd like to play devil's advocate and propose that we deep-compare HBaseConfiguration instances, so that multiple HBaseConfiguration instances that have the same properties map to the same HConnection instance. In case one is concerned that a single HConnection is insufficient for sharing amongst clients, to quote the javadoc, then one should be able to mark a given HBaseConfiguration instance as being uniquely identifiable. bq. bq. bq. This addresses bug HBASE-3777. bq. https://issues.apache.org/jira/browse/HBASE-3777 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 bq.src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java be31179 bq.src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a bq.src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java c348f7a bq.src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 bq.src/main/java/org/apache/hadoop/hbase/client/HTablePool.java 88827a8 bq.src/main/java/org/apache/hadoop/hbase/client/MetaScanner.java 9e3f4d1 bq. src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java d76e333 bq. src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java ed88bfa bq.
[jira] [Commented] (HBASE-3823) NPE in ZKAssign.transitionNode
[ https://issues.apache.org/jira/browse/HBASE-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025399#comment-13025399 ] stack commented on HBASE-3823: -- Is this HBASE-3627 (fixed in 0.90.2)? NPE in ZKAssign.transitionNode -- Key: HBASE-3823 URL: https://issues.apache.org/jira/browse/HBASE-3823 Project: HBase Issue Type: Bug Reporter: Prakash Khemani This issue led to a region being multiply assigned. hbck output ERROR: Region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. is listed in META on region server pumahbase107.snc5.facebook.com:60020 but is multiply assigned to region servers pumahbase150.snc5.facebook.com:60020, pumahbase107.snc5.facebook.com:60020 === 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat); RegionTransitionData existingData = RegionTransitionData.fromBytes(existingBytes); existingBytes can be null. have to return -1 if null. === master logs 2011-04-25 05:24:03,250 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:19,246 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 (wrote 4342690 edits in 46904ms) 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x32f7bb74e8a Creating (or updating) unassigned node for e7a478b4bd164525052f1dedb832de0a with OFFLINE state 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. so generated a random one; hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a., src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 (online=70, exclude=null) available servers 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. to pumahbase107.snc5.facebook.com,60020,1303450731227 2011-04-25 09:09:26,139 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:44,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:59,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:14,054 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:29,055 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:44,060 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING,
[jira] [Commented] (HBASE-3822) region server stuck in waitOnAllRegionsToClose
[ https://issues.apache.org/jira/browse/HBASE-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025443#comment-13025443 ] Prakash Khemani commented on HBASE-3822: The code snippet that I pointed out doesn't have a problem - that piece of code will remove the region from online regions even if there is an exception. Sorry for the confusion. I don't really know why the onlineRegions set was not cleaned up. region server stuck in waitOnAllRegionsToClose -- Key: HBASE-3822 URL: https://issues.apache.org/jira/browse/HBASE-3822 Project: HBase Issue Type: Bug Reporter: Prakash Khemani The regionserver is not able to exit because the rs thread is stuck here regionserver60020 prio=10 tid=0x2ab2b039e000 nid=0x760a waiting on condition [0x4365e000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer.waitOnAllRegionsToClose(HRegionServer.java:736) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:689) at java.lang.Thread.run(Thread.java:619) === In CloseRegionHandler.process() we do not call removeFromOnlineRegions() if there is an exception. (In this case I suspect there was a log-rolling exception because of another issue) // Close the region try { // TODO: If we need to keep updating CLOSING stamp to prevent against // a timeout if this is long-running, need to spin up a thread? if (region.close(abort) == null) { // This region got closed. Most likely due to a split. So instead // of doing the setClosedState() below, let's just ignore and continue. // The split message will clean up the master state. LOG.warn(Can't close region: was already closed during close(): + regionInfo.getRegionNameAsString()); return; } } catch (IOException e) { LOG.error(Unrecoverable exception while closing region + regionInfo.getRegionNameAsString() + , still finishing close, e); } this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName()); === I think we set the closing flag on the region, it won't be taking any more requests, it is as good as offline. Either we should refine the check in waitOnAllRegionsToClose() or CloseRegionHandler.process() should remove the region from online-regions set. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-3823) NPE in ZKAssign.transitionNode
[ https://issues.apache.org/jira/browse/HBASE-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani resolved HBASE-3823. Resolution: Duplicate Release Note: fixed in HBASE-3627 NPE in ZKAssign.transitionNode -- Key: HBASE-3823 URL: https://issues.apache.org/jira/browse/HBASE-3823 Project: HBase Issue Type: Bug Reporter: Prakash Khemani This issue led to a region being multiply assigned. hbck output ERROR: Region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. is listed in META on region server pumahbase107.snc5.facebook.com:60020 but is multiply assigned to region servers pumahbase150.snc5.facebook.com:60020, pumahbase107.snc5.facebook.com:60020 === 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat); RegionTransitionData existingData = RegionTransitionData.fromBytes(existingBytes); existingBytes can be null. have to return -1 if null. === master logs 2011-04-25 05:24:03,250 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:19,246 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 (wrote 4342690 edits in 46904ms) 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x32f7bb74e8a Creating (or updating) unassigned node for e7a478b4bd164525052f1dedb832de0a with OFFLINE state 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. so generated a random one; hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a., src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 (online=70, exclude=null) available servers 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. to pumahbase107.snc5.facebook.com,60020,1303450731227 2011-04-25 09:09:26,139 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:44,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:59,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:14,054 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:29,055 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:44,060 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING,
[jira] [Created] (HBASE-3824) region server timed out during open region
region server timed out during open region -- Key: HBASE-3824 URL: https://issues.apache.org/jira/browse/HBASE-3824 Project: HBase Issue Type: Bug Reporter: Prakash Khemani When replaying a large log file, mestore flushes can happen. But there is no Progressible report being sent during memstore flushes. That can lead to master timing out the region server during region open. === Another related issue and Jonathan's response So if a region server that is handed a region for opening and has done part of the work ... it has created some HFiles (because the logs were so huge that the mestore got flushed while the logs were being replayed) ... and then it is asked to give up because the master thought the region server was taking too long to open the region. When the region server gives up on the region then will it make sure that it removes all the HFiles it had created for that region? Will need to check the code, but would it matter? One issue is whether it cleans up after itself (I'm guessing not). Another issue is whether the replay is idempotent (duplicate KVs across files shouldn't matter in most cases). === 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat); RegionTransitionData existingData = RegionTransitionData.fromBytes(existingBytes); existingBytes can be null. have to return -1 if null. === master logs 2011-04-25 05:24:03,250 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:19,246 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 (wrote 4342690 edits in 46904ms) 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x32f7bb74e8a Creating (or updating) unassigned node for e7a478b4bd164525052f1dedb832de0a with OFFLINE state 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. so generated a random one; hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a., src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 (online=70, exclude=null) available servers 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. to pumahbase107.snc5.facebook.com,60020,1303450731227 2011-04-25 09:09:26,139 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:44,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:59,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:14,054 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:29,055 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227,
[jira] [Commented] (HBASE-3484) Replace memstore's ConcurrentSkipListMap with our own implementation
[ https://issues.apache.org/jira/browse/HBASE-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025463#comment-13025463 ] Joe Pallas commented on HBASE-3484: --- This issue was cited by jdcryans as related to unfortunate performance seen in the following case: A test program fills a single row of a family with tens of thousands of sequentially increasing qualifiers. Then it performs random gets (or exists) of those qualifiers. The response time seen is (on average) proportional to the ordinal position of the qualifier. If the table is flushed before the random tests begin, then the average response time is basically constant, independent of the qualifier's ordinal position. I'm not sure that either of the two points in the description actually covers this case, but I don't know enough to say. Replace memstore's ConcurrentSkipListMap with our own implementation Key: HBASE-3484 URL: https://issues.apache.org/jira/browse/HBASE-3484 Project: HBase Issue Type: Improvement Affects Versions: 0.92.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.92.0 By copy-pasting ConcurrentSkipListMap into HBase we can make two improvements to it for our use case in MemStore: - add an iterator.replace() method which should allow us to do upsert much more cheaply - implement a Set directly without having to do MapKeyValue,KeyValue to save one reference per entry It turns out CSLM is in public domain from its development as part of JSR 166, so we should be OK with licenses. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3824) region server timed out during open region
[ https://issues.apache.org/jira/browse/HBASE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025483#comment-13025483 ] Jean-Daniel Cryans commented on HBASE-3824: --- So what's the issue about exactly? We expect region server to time out opening AFAIK, so is the problem more about the idempotent nature of opening a region and then failing at doing it when it's assigned somewhere else? region server timed out during open region -- Key: HBASE-3824 URL: https://issues.apache.org/jira/browse/HBASE-3824 Project: HBase Issue Type: Bug Reporter: Prakash Khemani When replaying a large log file, mestore flushes can happen. But there is no Progressible report being sent during memstore flushes. That can lead to master timing out the region server during region open. === Another related issue and Jonathan's response So if a region server that is handed a region for opening and has done part of the work ... it has created some HFiles (because the logs were so huge that the mestore got flushed while the logs were being replayed) ... and then it is asked to give up because the master thought the region server was taking too long to open the region. When the region server gives up on the region then will it make sure that it removes all the HFiles it had created for that region? Will need to check the code, but would it matter? One issue is whether it cleans up after itself (I'm guessing not). Another issue is whether the replay is idempotent (duplicate KVs across files shouldn't matter in most cases). === 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat); RegionTransitionData existingData = RegionTransitionData.fromBytes(existingBytes); existingBytes can be null. have to return -1 if null. === master logs 2011-04-25 05:24:03,250 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:19,246 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 (wrote 4342690 edits in 46904ms) 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x32f7bb74e8a Creating (or updating) unassigned node for e7a478b4bd164525052f1dedb832de0a with OFFLINE state 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. so generated a random one; hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a., src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 (online=70, exclude=null) available servers 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. to pumahbase107.snc5.facebook.com,60020,1303450731227 2011-04-25 09:09:26,139 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:44,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:59,050 DEBUG
[jira] [Resolved] (HBASE-3824) region server timed out during open region
[ https://issues.apache.org/jira/browse/HBASE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani resolved HBASE-3824. Resolution: Not A Problem region server timed out during open region -- Key: HBASE-3824 URL: https://issues.apache.org/jira/browse/HBASE-3824 Project: HBase Issue Type: Bug Reporter: Prakash Khemani When replaying a large log file, mestore flushes can happen. But there is no Progressible report being sent during memstore flushes. That can lead to master timing out the region server during region open. === Another related issue and Jonathan's response So if a region server that is handed a region for opening and has done part of the work ... it has created some HFiles (because the logs were so huge that the mestore got flushed while the logs were being replayed) ... and then it is asked to give up because the master thought the region server was taking too long to open the region. When the region server gives up on the region then will it make sure that it removes all the HFiles it had created for that region? Will need to check the code, but would it matter? One issue is whether it cleans up after itself (I'm guessing not). Another issue is whether the replay is idempotent (duplicate KVs across files shouldn't matter in most cases). === 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat); RegionTransitionData existingData = RegionTransitionData.fromBytes(existingBytes); existingBytes can be null. have to return -1 if null. === master logs 2011-04-25 05:24:03,250 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:19,246 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 (wrote 4342690 edits in 46904ms) 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x32f7bb74e8a Creating (or updating) unassigned node for e7a478b4bd164525052f1dedb832de0a with OFFLINE state 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. so generated a random one; hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a., src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 (online=70, exclude=null) available servers 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. to pumahbase107.snc5.facebook.com,60020,1303450731227 2011-04-25 09:09:26,139 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:44,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:59,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:10:14,054 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
[jira] [Resolved] (HBASE-3629) Update our thrift to 0.6
[ https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3629. -- Resolution: Fixed Hadoop Flags: [Reviewed] Update our thrift to 0.6 Key: HBASE-3629 URL: https://issues.apache.org/jira/browse/HBASE-3629 Project: HBase Issue Type: Task Reporter: stack Assignee: Moaz Reyad Attachments: HBASE-3629.patch.zip, pom.diff HBASE-3117 was about updating to 0.5. Moaz Reyad over in that issue is trying to move us to 0.6. Lets move the 0.6 upgrade effort here. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3824) region server timed out during open region
[ https://issues.apache.org/jira/browse/HBASE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025491#comment-13025491 ] Prakash Khemani commented on HBASE-3824: Probably not an issue. The memstore flush happens in the background and cannot cause the log-replay thread to block. My mistake. I will close this. region server timed out during open region -- Key: HBASE-3824 URL: https://issues.apache.org/jira/browse/HBASE-3824 Project: HBase Issue Type: Bug Reporter: Prakash Khemani When replaying a large log file, mestore flushes can happen. But there is no Progressible report being sent during memstore flushes. That can lead to master timing out the region server during region open. === Another related issue and Jonathan's response So if a region server that is handed a region for opening and has done part of the work ... it has created some HFiles (because the logs were so huge that the mestore got flushed while the logs were being replayed) ... and then it is asked to give up because the master thought the region server was taking too long to open the region. When the region server gives up on the region then will it make sure that it removes all the HFiles it had created for that region? Will need to check the code, but would it matter? One issue is whether it cleans up after itself (I'm guessing not). Another issue is whether the replay is idempotent (duplicate KVs across files shouldn't matter in most cases). === 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.lang.NullPointerException at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat); RegionTransitionData existingData = RegionTransitionData.fromBytes(existingBytes); existingBytes can be null. have to return -1 if null. === master logs 2011-04-25 05:24:03,250 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:19,246 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047 (wrote 4342690 edits in 46904ms) 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x32f7bb74e8a Creating (or updating) unassigned node for e7a478b4bd164525052f1dedb832de0a with OFFLINE state 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. so generated a random one; hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a., src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 (online=70, exclude=null) available servers 2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a. to pumahbase107.snc5.facebook.com,60020,1303450731227 2011-04-25 09:09:26,139 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:44,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=pumahbase107.snc5.facebook.com,60020,1303450731227, region=e7a478b4bd164525052f1dedb832de0a 2011-04-25 09:09:59,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING,
[jira] [Updated] (HBASE-3629) Update our thrift to 0.6
[ https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3629: - Release Note: Updated our thrift to 0.6.1. Incompatible change with previous HBase thrift. Update our thrift to 0.6 Key: HBASE-3629 URL: https://issues.apache.org/jira/browse/HBASE-3629 Project: HBase Issue Type: Task Reporter: stack Assignee: Moaz Reyad Fix For: 0.92.0 Attachments: HBASE-3629.patch.zip, pom.diff HBASE-3117 was about updating to 0.5. Moaz Reyad over in that issue is trying to move us to 0.6. Lets move the 0.6 upgrade effort here. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3629) Update our thrift to 0.6
[ https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3629: - Fix Version/s: 0.92.0 Resolved. Applied to TRUNK (With Lars Francke suggested changes). Thanks for the patch Moaz (its missing apache license off the generated files but I think that ok -- until someone tells me otherwise). Thanks for the review Lars. Update our thrift to 0.6 Key: HBASE-3629 URL: https://issues.apache.org/jira/browse/HBASE-3629 Project: HBase Issue Type: Task Reporter: stack Assignee: Moaz Reyad Fix For: 0.92.0 Attachments: HBASE-3629.patch.zip, pom.diff HBASE-3117 was about updating to 0.5. Moaz Reyad over in that issue is trying to move us to 0.6. Lets move the 0.6 upgrade effort here. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3741) OpenRegionHandler and CloseRegionHandler are possibly racing
[ https://issues.apache.org/jira/browse/HBASE-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-3741: -- Attachment: HBASE-3741-rsfix-v3.patch Takes care of what Stack mentioned in his review except for getRegionsInTransitionInRS that needs to bu public (in the scope of RegionServerServices). OpenRegionHandler and CloseRegionHandler are possibly racing Key: HBASE-3741 URL: https://issues.apache.org/jira/browse/HBASE-3741 Project: HBase Issue Type: Bug Affects Versions: 0.90.1 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Blocker Fix For: 0.90.3 Attachments: HBASE-3741-rsfix-v2.patch, HBASE-3741-rsfix-v3.patch, HBASE-3741-rsfix.patch This is a serious issue about a race between regions being opened and closed in region servers. We had this situation where the master tried to unassign a region for balancing, failed, force unassigned it, force assigned it somewhere else, failed to open it on another region server (took too long), and then reassigned it back to the original region server. A few seconds later, the region server processed the first closed and the region was left unassigned. This is from the master log: {quote} 11-04-05 15:11:17,758 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=sv4borg42,60020,1300920459477, load=(requests=187, regions=574, usedHeap=3918, maxHeap=6973) for region stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 2011-04-05 15:12:10,021 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 state=PENDING_CLOSE, ts=1302041477758 2011-04-05 15:12:10,021 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 ... 2011-04-05 15:14:45,783 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 state=CLOSED, ts=1302041685733 2011-04-05 15:14:45,783 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x42ec2cece810b68 Creating (or updating) unassigned node for 1470298961 with OFFLINE state ... 2011-04-05 15:14:45,885 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961; plan=hri=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961, src=sv4borg42,60020,1300920459477, dest=sv4borg40,60020,1302041218196 2011-04-05 15:14:45,885 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 to sv4borg40,60020,1302041218196 2011-04-05 15:15:39,410 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 state=PENDING_OPEN, ts=1302041700944 2011-04-05 15:15:39,410 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 2011-04-05 15:15:39,410 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 state=PENDING_OPEN, ts=1302041700944 ... 2011-04-05 15:15:39,410 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 so generated a random one; hri=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961, src=, dest=sv4borg42,60020,1300920459477; 19 (online=19, exclude=null) available servers 2011-04-05 15:15:39,410 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961 to sv4borg42,60020,1300920459477 2011-04-05 15:15:40,951 DEBUG
[jira] [Resolved] (HBASE-3805) Log RegionState that are processed too late in the master
[ https://issues.apache.org/jira/browse/HBASE-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-3805. --- Resolution: Fixed Committed the additional logging. Log RegionState that are processed too late in the master -- Key: HBASE-3805 URL: https://issues.apache.org/jira/browse/HBASE-3805 Project: HBase Issue Type: Improvement Affects Versions: 0.90.2 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Minor Fix For: 0.90.3 Attachments: HBASE-3805.patch Working on all the weird delayed processing in the master, I saw that it was hard to figure when a zookeeper event is processed too late. For example, cases where the processing of the events gets too slow and the master takes more than a minute after the event is triggered in the region server to get to it's processing. We should at least print that out. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1502) Remove need for heartbeats in HBase
[ https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025520#comment-13025520 ] jirapos...@reviews.apache.org commented on HBASE-1502: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/647/ --- (Updated 2011-04-26 23:50:23.656187) Review request for hbase. Changes --- All tests pass now. I'd like to get this patch in soon. I'm currently spending a good bit of my time trying to keep this patch up with current TRUNK. I'd rather commit and then address issues after. This version of the patch does one make significant change though in that it deprecates prewarmRegionCache. IMO this is a burdensome feature that is little used; i'd like to have it die off. Summary --- This patch does not completely remove heartbeats. It unburdens the heartbeat of control messages; now heartbeat is used to send the master load only (At most recent hackathon we had rough agreement that we'd keep heartbeat to carry load)... if we miss some, no biggie. RPC version changed on HMasterRegionInfo since the regionServerStartup and regionServerReport arguments have changed. We pass a String now instead of HServerAddress so this should help with our DNS issues where the two sides disagree. Removed HMsg. HServerAddress as been sort_of_deprecated. Its in our API so can't remove it easily (its embedded inside HRegionLocation). Otherwise, we don't use it internally anymore. HServerInfo is deprecated. Server meta data is now available in new class ServerName and load lives apart from HSI now. Fixed up regionserver and master startup so they now look the same. New tests Cruft cleanup. This addresses bug hbase-1502. https://issues.apache.org/jira/browse/hbase-1502 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/ClusterStatus.java 26a8bef src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 src/main/java/org/apache/hadoop/hbase/HMsg.java 87beb00 src/main/java/org/apache/hadoop/hbase/HRegionLocation.java bd353b8 src/main/java/org/apache/hadoop/hbase/HServerAddress.java 7f8a472 src/main/java/org/apache/hadoop/hbase/HServerInfo.java 0b5bd94 src/main/java/org/apache/hadoop/hbase/HServerLoad.java 2372053 src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java 0d696ab src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1da9742 src/main/java/org/apache/hadoop/hbase/Server.java df396fa src/main/java/org/apache/hadoop/hbase/ServerName.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/avro/AvroUtil.java d7a1e67 src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java be31179 src/main/java/org/apache/hadoop/hbase/catalog/MetaEditor.java c2ee031 src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 6e22cf5 src/main/java/org/apache/hadoop/hbase/catalog/RootLocationEditor.java aee64c5 src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a src/main/java/org/apache/hadoop/hbase/client/HConnection.java 2bb4725 src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java c348f7a src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 src/main/java/org/apache/hadoop/hbase/client/RetriesExhaustedWithDetailsException.java 6c62024 src/main/java/org/apache/hadoop/hbase/coprocessor/BaseMasterObserver.java 8df6aa4 src/main/java/org/apache/hadoop/hbase/coprocessor/MasterObserver.java d64817f src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java c22e342 src/main/java/org/apache/hadoop/hbase/executor/RegionTransitionData.java a55f9d6 src/main/java/org/apache/hadoop/hbase/io/HbaseObjectWritable.java d8f8463 src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java ec28de4 src/main/java/org/apache/hadoop/hbase/ipc/HMasterRegionInterface.java 25139b3 src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 663cab5 src/main/java/org/apache/hadoop/hbase/ipc/WritableRpcEngine.java 2273e55 src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java 66a3345 src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 58fdb30 src/main/java/org/apache/hadoop/hbase/master/DeadServer.java 05600c4 src/main/java/org/apache/hadoop/hbase/master/HMaster.java 79a48ba src/main/java/org/apache/hadoop/hbase/master/LoadBalancer.java 6c92cbc src/main/java/org/apache/hadoop/hbase/master/MasterCoprocessorHost.java 4bb072e src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java 55e0162 src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 04befe9 src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java dada818
[jira] [Updated] (HBASE-3794) TestRpcMetrics fails on machine where region server is running
[ https://issues.apache.org/jira/browse/HBASE-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3794: - Resolution: Fixed Fix Version/s: 0.90.3 Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Committed branch and trunk. Thanks for the patch Alex. TestRpcMetrics fails on machine where region server is running -- Key: HBASE-3794 URL: https://issues.apache.org/jira/browse/HBASE-3794 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.90.2 Reporter: Ted Yu Assignee: Alex Newman Fix For: 0.90.3 Attachments: HBASE-3794.patch Since whole test suite takes over an hour to run, I ran them on Linux where region server is running. Here is the consistent TestRpcMetrics failure I saw: {code} Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.196 sec FAILURE! testCustomMetrics(org.apache.hadoop.hbase.regionserver.TestRpcMetrics) Time elapsed: 0.079 sec ERROR! java.net.BindException: Problem binding to /10.202.50.107:60020 : Address already in use at org.apache.hadoop.hbase.ipc.HBaseServer.bind(HBaseServer.java:216) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.init(HBaseServer.java:283) at org.apache.hadoop.hbase.ipc.HBaseServer.init(HBaseServer.java:1189) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.init(WritableRpcEngine.java:266) at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getServer(WritableRpcEngine.java:233) at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getServer(WritableRpcEngine.java:46) at org.apache.hadoop.hbase.ipc.HBaseRPC.getServer(HBaseRPC.java:379) at org.apache.hadoop.hbase.ipc.HBaseRPC.getServer(HBaseRPC.java:368) at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:336) at org.apache.hadoop.hbase.regionserver.TestRpcMetrics$TestRegionServer.init(TestRpcMetrics.java:58) at org.apache.hadoop.hbase.regionserver.TestRpcMetrics.testCustomMetrics(TestRpcMetrics.java:119) {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3811) Allow adding attributes to Scan
[ https://issues.apache.org/jira/browse/HBASE-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025522#comment-13025522 ] stack commented on HBASE-3811: -- You could change Scan so it passes an optional Map. Add an attributes method to Scan. If attributes 0, instantiate a Map and then serialize it out. On other end, have check for null or non-null Map when deserializing (We don't want to carry the Map for every Scan I'd say -- just carry it when attributes present... but perhaps I'm doing premature optimization here). This could become more important now we have CPs. Allow adding attributes to Scan --- Key: HBASE-3811 URL: https://issues.apache.org/jira/browse/HBASE-3811 Project: HBase Issue Type: Improvement Components: client Reporter: Alex Baranau Priority: Minor There's sometimes a need to add custom attribute to Scan object so that it can be accessed on server side. Example of the case where it is needed discussed here: http://search-hadoop.com/m/v3Jtb2GkiO. There might be other cases where it is useful, which are mostly about logging/gathering stats on server side. Alternative to allowing adding any custom attributes to scan could be adding some fixed field, like type to the class. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-3210) HBASE-1921 for the new master
[ https://issues.apache.org/jira/browse/HBASE-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3210. -- Resolution: Fixed Assignee: Subbu M Iyer Hadoop Flags: [Reviewed] Committed to TRUNK. Thank you for the patch Subbu. HBASE-1921 for the new master - Key: HBASE-3210 URL: https://issues.apache.org/jira/browse/HBASE-3210 Project: HBase Issue Type: Improvement Reporter: Jean-Daniel Cryans Assignee: Subbu M Iyer Priority: Critical Fix For: 0.92.0 Attachments: HBASE-3210-When_the_Master_s_session_times_out_and_there_s_only_one,_cluster_is_wedged.patch, HBASE-3210-When_the_Master_s_session_times_out_and_there_s_only_one_cluster_is_wedged-2.patch, HBASE-3210-When_the_Master_s_session_times_out_and_there_s_only_one_cluster_is_wedged-3.patch HBASE-1921 was lost when writing the new master code. I guess it's going to be much harder to implement now, but I think it's a critical feature to have considering the reasons that brought me do it in the old master. There's already a test in TestZooKeeper which has been disabled a while ago. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1502) Remove need for heartbeats in HBase
[ https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025526#comment-13025526 ] jirapos...@reviews.apache.org commented on HBASE-1502: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/647/#review568 --- Ship it! woohoo! glad HMsg is dead! src/main/java/org/apache/hadoop/hbase/ClusterStatus.java https://reviews.apache.org/r/647/#comment1206 Can just use this.liveServers.values() for here and below? src/main/java/org/apache/hadoop/hbase/HServerAddress.java https://reviews.apache.org/r/647/#comment1207 Where is this actually used now? Should point it out here so it's clear and so that when it goes away we know we can get rid of this. src/main/java/org/apache/hadoop/hbase/HServerInfo.java https://reviews.apache.org/r/647/#comment1208 i see webuiport below, does this TODO still apply? src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java https://reviews.apache.org/r/647/#comment1209 why String and not ServerName? because master has no startcode? (i see use of ServerName for master above tho) src/main/java/org/apache/hadoop/hbase/ServerName.java https://reviews.apache.org/r/647/#comment1210 awesome that this is tucked away in here now src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java https://reviews.apache.org/r/647/#comment1211 this is because HSA actually makes a connection or does the lookup? - Jonathan On 2011-04-26 23:50:23, Michael Stack wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/647/ bq. --- bq. bq. (Updated 2011-04-26 23:50:23) bq. bq. bq. Review request for hbase. bq. bq. bq. Summary bq. --- bq. bq. This patch does not completely remove heartbeats. It unburdens the heartbeat of control messages; now heartbeat is used to bq. send the master load only (At most recent hackathon we had rough agreement that we'd keep heartbeat to carry load)... if we miss some, no biggie. bq. bq. RPC version changed on HMasterRegionInfo since the regionServerStartup and regionServerReport arguments have changed. bq. We pass a String now instead of HServerAddress so this should help with our DNS issues where the two sides disagree. bq. bq. Removed HMsg. bq. bq. HServerAddress as been sort_of_deprecated. Its in our API so can't remove it easily (its embedded inside HRegionLocation). bq. Otherwise, we don't use it internally anymore. bq. bq. HServerInfo is deprecated. Server meta data is now available in new class ServerName and load lives apart from HSI now. bq. bq. Fixed up regionserver and master startup so they now look the same. bq. bq. New tests bq. bq. Cruft cleanup. bq. bq. bq. This addresses bug hbase-1502. bq. https://issues.apache.org/jira/browse/hbase-1502 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/ClusterStatus.java 26a8bef bq.src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 bq.src/main/java/org/apache/hadoop/hbase/HMsg.java 87beb00 bq.src/main/java/org/apache/hadoop/hbase/HRegionLocation.java bd353b8 bq.src/main/java/org/apache/hadoop/hbase/HServerAddress.java 7f8a472 bq.src/main/java/org/apache/hadoop/hbase/HServerInfo.java 0b5bd94 bq.src/main/java/org/apache/hadoop/hbase/HServerLoad.java 2372053 bq.src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java 0d696ab bq.src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1da9742 bq.src/main/java/org/apache/hadoop/hbase/Server.java df396fa bq.src/main/java/org/apache/hadoop/hbase/ServerName.java PRE-CREATION bq.src/main/java/org/apache/hadoop/hbase/avro/AvroUtil.java d7a1e67 bq.src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java be31179 bq.src/main/java/org/apache/hadoop/hbase/catalog/MetaEditor.java c2ee031 bq.src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 6e22cf5 bq.src/main/java/org/apache/hadoop/hbase/catalog/RootLocationEditor.java aee64c5 bq.src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a bq.src/main/java/org/apache/hadoop/hbase/client/HConnection.java 2bb4725 bq.src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java c348f7a bq.src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 bq. src/main/java/org/apache/hadoop/hbase/client/RetriesExhaustedWithDetailsException.java 6c62024 bq. src/main/java/org/apache/hadoop/hbase/coprocessor/BaseMasterObserver.java 8df6aa4 bq.
[jira] [Commented] (HBASE-3777) Redefine Identity Of HBase Configuration
[ https://issues.apache.org/jira/browse/HBASE-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025543#comment-13025543 ] jirapos...@reviews.apache.org commented on HBASE-3777: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/643/#review569 --- src/main/java/org/apache/hadoop/hbase/master/HMaster.java https://reviews.apache.org/r/643/#comment1213 Same comment as in HRS, I think this is creating a second connection for the master. src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java https://reviews.apache.org/r/643/#comment1212 IIUC, we are creating an additional connection here since CT will do a getConnection with the passed conf instead of using a connection that the RS already has. - Jean-Daniel On 2011-04-22 21:16:59, Karthick Sankarachary wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/643/ bq. --- bq. bq. (Updated 2011-04-22 21:16:59) bq. bq. bq. Review request for hbase and Ted Yu. bq. bq. bq. Summary bq. --- bq. bq. Judging from the javadoc in HConnectionManager, sharing connections across multiple clients going to the same cluster is supposedly a good thing. However, the fact that there is a one-to-one mapping between a configuration and connection instance, kind of works against that goal. Specifically, when you create HTable instances using a given Configuration instance and a copy thereof, we end up with two distinct HConnection instances under the covers. Is this really expected behavior, especially given that the configuration instance gets cloned a lot? bq. bq. Here, I'd like to play devil's advocate and propose that we deep-compare HBaseConfiguration instances, so that multiple HBaseConfiguration instances that have the same properties map to the same HConnection instance. In case one is concerned that a single HConnection is insufficient for sharing amongst clients, to quote the javadoc, then one should be able to mark a given HBaseConfiguration instance as being uniquely identifiable. bq. bq. bq. This addresses bug HBASE-3777. bq. https://issues.apache.org/jira/browse/HBASE-3777 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 bq.src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java be31179 bq.src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a bq.src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java c348f7a bq.src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 bq.src/main/java/org/apache/hadoop/hbase/client/HTablePool.java 88827a8 bq.src/main/java/org/apache/hadoop/hbase/client/MetaScanner.java 9e3f4d1 bq. src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java d76e333 bq. src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java ed88bfa bq.src/main/java/org/apache/hadoop/hbase/master/HMaster.java 79a48ba bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java d0a1e11 bq. src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java 78c3b42 bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 5da5e34 bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java b624d28 bq.src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java 7f5b377 bq.src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java dc471c4 bq.src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java e25184e bq.src/test/java/org/apache/hadoop/hbase/catalog/TestMetaReaderEditor.java 60320a3 bq.src/test/java/org/apache/hadoop/hbase/client/TestHCM.java b01a2d2 bq.src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java 624f4a8 bq.src/test/java/org/apache/hadoop/hbase/util/TestMergeTable.java 8992dbb bq. bq. Diff: https://reviews.apache.org/r/643/diff bq. bq. bq. Testing bq. --- bq. bq. mvn test bq. bq. bq. Thanks, bq. bq. Karthick bq. bq. Redefine Identity Of HBase Configuration Key: HBASE-3777 URL: https://issues.apache.org/jira/browse/HBASE-3777 Project: HBase Issue Type: Improvement Components: client, ipc Affects Versions: 0.90.2 Reporter: Karthick Sankarachary Assignee: Karthick Sankarachary Priority: Minor Fix For: 0.92.0
[jira] [Updated] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour
[ https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoushuaifeng updated HBASE-3821: - Attachment: HBase-3821 v1.txt I think there are several ways to fix it: 1, in the roll back handling , like this: case CREATE_SPLIT_DIR: +this.parent.writestate.writesEnabled = true; cleanupSplitDir(fs, this.splitdir); break; 2, catch ioException after doclose or preflush. I think the first one is better. Do you think? And if there is anything else should be done? The patch is the first way. NOT flushing memstore for region keep on printing for half an hour - Key: HBASE-3821 URL: https://issues.apache.org/jira/browse/HBASE-3821 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.1 Reporter: zhoushuaifeng Fix For: 0.90.3 Attachments: HBase-3821 v1.txt NOT flushing memstore for region keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling. I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false and may run close preflush. This will make flush fail and print NOT flushing memstore for region. But It should be finished after a while. logs: 2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 9ms 2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec 2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions flushes 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore size 105.6m 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false .. 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region
[jira] [Commented] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour
[ https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025557#comment-13025557 ] zhoushuaifeng commented on HBASE-3821: -- Yes, the fail to split is a transitory error. It's able to flush memory successfully later. NOT flushing memstore for region keep on printing for half an hour - Key: HBASE-3821 URL: https://issues.apache.org/jira/browse/HBASE-3821 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.1 Reporter: zhoushuaifeng Fix For: 0.90.3 Attachments: HBase-3821 v1.txt NOT flushing memstore for region keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling. I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false and may run close preflush. This will make flush fail and print NOT flushing memstore for region. But It should be finished after a while. logs: 2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 9ms 2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec 2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions flushes 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore size 105.6m 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false .. 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. Mon Apr 18 17:04:24 IST 2011 Starting regionserver on linux253 ulimit
[jira] [Commented] (HBASE-3752) Tool to replay moved-aside WAL logs
[ https://issues.apache.org/jira/browse/HBASE-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025593#comment-13025593 ] jack levin commented on HBASE-3752: --- getting this error: 12:59:11 10.103.7.1 root@mtag11:~/tmp $ /usr/lib/hbase/bin/hbase org.jruby.Main walplayer.rb 208.94.1.252%3A60020.1303872726316 file:/usr/lib/hbase-JD/hbase-0.89.20100830/lib/jruby-complete-1.4.0.jar!/META-INF/jruby.home/lib/ruby/site_ruby/shared/builtin/javasupport/core_ext/object.rb:37:in `get_proxy_or_package_under_package': cannot load Java class org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException (NameError) from file:/usr/lib/hbase-JD/hbase-0.89.20100830/lib/jruby-complete-1.4.0.jar!/META-INF/jruby.home/lib/ruby/site_ruby/shared/builtin/javasupport/java.rb:51:in `method_missing' from walplayer.rb:39 Tool to replay moved-aside WAL logs --- Key: HBASE-3752 URL: https://issues.apache.org/jira/browse/HBASE-3752 Project: HBase Issue Type: Task Reporter: stack Priority: Critical Attachments: walplayer.rb We need this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour
[ https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025594#comment-13025594 ] ramkrishna.s.vasudevan commented on HBASE-3821: --- I would like to add up to zhoushuaifeng analysis Suppose the api cleanUpSplitDir() is throwing IOExceptin due to some DFS error We try to roll back. As part of rollback if we still get an DFS error then the exception is not handled. We try to handle only RunTimeException and hence the exceptin gets propogated till the run method of the CompactSplitThread. Here again the exception is not handled. I think its better to handle this so that atleast the user comes to know about the exception. NOT flushing memstore for region keep on printing for half an hour - Key: HBASE-3821 URL: https://issues.apache.org/jira/browse/HBASE-3821 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.1 Reporter: zhoushuaifeng Fix For: 0.90.3 Attachments: HBase-3821 v1.txt NOT flushing memstore for region keep on printing for half an hour in the regionserver. Then I restart hbase. I think there may be deadlock or cycling. I know that when splitting region, it will doclose of region, and set writestate.writesEnabled = false and may run close preflush. This will make flush fail and print NOT flushing memstore for region. But It should be finished after a while. logs: 2011-04-18 16:28:27,960 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=1 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:30,171 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; delaying flush up to 9ms 2011-04-18 16:28:32,119 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, entries=5226, filesize=255913736. New hlog /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 from region 031f37c9c23fcab17797b06b90205610 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 4 file(s), new file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290, size=281.4m; total size for store is 468.8m 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: completed compaction on region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec 2011-04-18 16:28:42,741 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling compactions flushes 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore size 105.6m 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, writesEnabled=false ..
[jira] [Updated] (HBASE-3744) createTable blocks until all regions are out of transition
[ https://issues.apache.org/jira/browse/HBASE-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-3744: -- Attachment: 3744-addendum.patch This patch simulates the semantics of waiting for region assignment. createTable blocks until all regions are out of transition -- Key: HBASE-3744 URL: https://issues.apache.org/jira/browse/HBASE-3744 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.1 Reporter: Todd Lipcon Assignee: Ted Yu Priority: Critical Fix For: 0.90.3 Attachments: 3744-addendum.patch, 3744-addendum.txt, 3744-v2.txt, 3744-v3.txt, 3744.txt, create_big_tables.rb, create_big_tables.rb, create_big_tables.rb In HBASE-3305, the behavior of createTable was changed and introduced this bug: createTable now blocks until all regions have been assigned, since it uses BulkStartupAssigner. BulkStartupAssigner.waitUntilDone calls assignmentManager.waitUntilNoRegionsInTransition, which waits across all regions, not just the regions of the table that has just been created. We saw an issue where one table had a region which was unable to be opened, so it was stuck in RegionsInTransition permanently (every open was failing). Since this was the case, waitUntilDone would always block indefinitely even though the newly created table had been assigned. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira