[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler
[ https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563404#comment-13563404 ] Hudson commented on HBASE-7669: --- Integrated in HBase-TRUNK #3802 (See [https://builds.apache.org/job/HBase-TRUNK/3802/]) HBASE-7669 Addendum fixes TestPriorityRpc (Revision 1438853) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestPriorityRpc.java ROOT region wouldn't be handled by PRI-IPC-Handler --- Key: HBASE-7669 URL: https://issues.apache.org/jira/browse/HBASE-7669 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.4 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0, 0.94.5 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the same as META region -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7671) Flushing memstore again after last failure could cause data loss
[ https://issues.apache.org/jira/browse/HBASE-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563405#comment-13563405 ] Hadoop QA commented on HBASE-7671: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566619/HBASE-7671v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.security.access.TestTablePermissions Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4198//console This message is automatically generated. Flushing memstore again after last failure could cause data loss Key: HBASE-7671 URL: https://issues.apache.org/jira/browse/HBASE-7671 Project: HBase Issue Type: Bug Affects Versions: 0.94.4 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7671.patch, HBASE-7671v2.patch, HBASE-7671v3.patch See the following logs first: {code} 2013-01-23 18:58:38,801 INFO org.apache.hadoop.hbase.regionserver.Store: Flushed , sequenceid=9746535080, memsize=101.8m, into tmp file hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/bebeeecc56364b6c8126cf1dc6782a25 2013-01-23 18:58:41,982 WARN org.apache.hadoop.hbase.regionserver.MemStore: Snapshot called again without clearing previous. Doing nothing. Another ongoing flush or did we fail last attempt? 2013-01-23 18:58:43,274 INFO org.apache.hadoop.hbase.regionserver.Store: Flushed , sequenceid=9746599334, memsize=101.8m, into tmp file hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/4eede32dc469480bb3d469aaff332313 {code} The first time memstore flush is failed when commitFile()(Logged the first edit above), then trigger server abort, but another flush is coming immediately(could caused by move/split,Logged the third edit above) and successful. For the same memstore's snapshot, we get different sequenceid, it causes data loss when replaying log edits See details from the unit test case in the patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7521) fix HBASE-6060 (regions stuck in opening state) in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563406#comment-13563406 ] ramkrishna.s.vasudevan commented on HBASE-7521: --- @Sergey Went thro the patch. It seems fine wrt to the changes taken up from the old patch. Just after looking into the patch now i feel may be we can rename some variables like regionsFromRIT, regionsIntersection. Is it possible to rename so that the name suggests what they are? Overall looks fine. One question is did you see what is the impact on now making the RS to transition the node to OPENING? Thanks Sergey fix HBASE-6060 (regions stuck in opening state) in 0.94 --- Key: HBASE-7521 URL: https://issues.apache.org/jira/browse/HBASE-7521 Project: HBase Issue Type: Bug Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HBASE-7521-original-patch-ported-v0.patch, HBASE-7521-v0.patch, HBASE-7521-v1.patch Discussion in HBASE-6060 implies that the fix there does not work on 0.94. Still, we may want to fix the issue in 0.94 (via some different fix) because the regions stuck in opening for ridiculous amounts of time is not a good thing to have. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7671) Flushing memstore again after last failure could cause data loss
[ https://issues.apache.org/jira/browse/HBASE-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563421#comment-13563421 ] ramkrishna.s.vasudevan commented on HBASE-7671: --- I will review this today and get back. Atleast to understand the problem Flushing memstore again after last failure could cause data loss Key: HBASE-7671 URL: https://issues.apache.org/jira/browse/HBASE-7671 Project: HBase Issue Type: Bug Affects Versions: 0.94.4 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7671.patch, HBASE-7671v2.patch, HBASE-7671v3.patch See the following logs first: {code} 2013-01-23 18:58:38,801 INFO org.apache.hadoop.hbase.regionserver.Store: Flushed , sequenceid=9746535080, memsize=101.8m, into tmp file hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/bebeeecc56364b6c8126cf1dc6782a25 2013-01-23 18:58:41,982 WARN org.apache.hadoop.hbase.regionserver.MemStore: Snapshot called again without clearing previous. Doing nothing. Another ongoing flush or did we fail last attempt? 2013-01-23 18:58:43,274 INFO org.apache.hadoop.hbase.regionserver.Store: Flushed , sequenceid=9746599334, memsize=101.8m, into tmp file hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/4eede32dc469480bb3d469aaff332313 {code} The first time memstore flush is failed when commitFile()(Logged the first edit above), then trigger server abort, but another flush is coming immediately(could caused by move/split,Logged the third edit above) and successful. For the same memstore's snapshot, we get different sequenceid, it causes data loss when replaying log edits See details from the unit test case in the patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler
[ https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563470#comment-13563470 ] Hudson commented on HBASE-7669: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #376 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/376/]) HBASE-7669 Addendum fixes TestPriorityRpc (Revision 1438853) HBASE-7669 ROOT region wouldn't be handled by PRI-IPC-Handler(Chunhui) (Revision 1438834) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestPriorityRpc.java zjushch : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java ROOT region wouldn't be handled by PRI-IPC-Handler --- Key: HBASE-7669 URL: https://issues.apache.org/jira/browse/HBASE-7669 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.4 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0, 0.94.5 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the same as META region -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7654) Add ListString getCoprocessors() to HTableDescriptor
[ https://issues.apache.org/jira/browse/HBASE-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563471#comment-13563471 ] Hudson commented on HBASE-7654: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #376 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/376/]) HBASE-7654. Add ListString getCoprocessors() to HTableDescriptor (Jean-Marc Spaggiari) (Revision 1438789) Result = FAILURE apurtell : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/TestHTableDescriptor.java Add ListString getCoprocessors() to HTableDescriptor -- Key: HBASE-7654 URL: https://issues.apache.org/jira/browse/HBASE-7654 Project: HBase Issue Type: Bug Components: Client, Coprocessors Affects Versions: 0.96.0, 0.94.5 Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7654-v0-0.94.patch, HBASE-7654-v0-trunk.patch Add ListString getCoprocessors() to HTableInterface to retreive the list of coprocessors loaded into this table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7516) Make compaction policy pluggable
[ https://issues.apache.org/jira/browse/HBASE-7516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563507#comment-13563507 ] Jimmy Xiang commented on HBASE-7516: Since HBASE-7571 is in, I will rebase and post another patch. Make compaction policy pluggable Key: HBASE-7516 URL: https://issues.apache.org/jira/browse/HBASE-7516 Project: HBase Issue Type: Sub-task Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HBASE-7516-v0.patch, HBASE-7516-v1.patch, HBASE-7516-v2.patch, trunk-7516.patch, trunk-7516_v2.patch Currently, the compaction selection is pluggable. It will be great to make the compaction algorithm pluggable too so that we can implement and play with other compaction algorithms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler
[ https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563518#comment-13563518 ] Ted Yu commented on HBASE-7669: --- Since this change went in, TestSplitTransactionOnCluster has failed in trunk from #3801 to #3803. I wasn't able to reproduce the test failure locally using java 1.6. From https://builds.apache.org/job/HBase-TRUNK/3801/testReport/junit/org.apache.hadoop.hbase.regionserver/TestSplitTransactionOnCluster/testShutdownSimpleFixup/: {code} 2013-01-26 05:28:56,063 ERROR [RS_OPEN_REGION-vesta.apache.org,40191,1359177837124-2] handler.OpenRegionHandler(376): Failed transitioning node testShutdownSimpleFixup,mnk,1359178057447.f4b0d2459f3a827fb4b950777b219e03. from OPENING to FAILED_OPEN org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/region-in-transition/f4b0d2459f3a827fb4b950777b219e03 at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:357) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:849) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:773) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:711) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tryTransitionFromOpeningToFailedOpen(OpenRegionHandler.java:363) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:144) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:202) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 2013-01-26 05:28:56,065 INFO [hbase-am-pool-79-thread-4] master.AssignmentManager(1596): Assigning region testMasterRestartAtRegionSplitPendingCatalogJanitor,mnk,1359177884826.317c20e7b762e72f3423ebc44fdbd0b5. to vesta.apache.org,40191,1359177837124 2013-01-26 05:28:56,064 DEBUG [hbase-am-zkevent-worker-pool-80-thread-9] master.AssignmentManager(641): Handling transition=M_ZK_REGION_OFFLINE, server=vesta.apache.org,40191,1359177837124, region=f4b0d2459f3a827fb4b950777b219e03, current state from region state map ={testShutdownSimpleFixup,mnk,1359178057447.f4b0d2459f3a827fb4b950777b219e03. state=OFFLINE, ts=1359178134450, server=null} {code} Not sure if the above was cause for test failure. ROOT region wouldn't be handled by PRI-IPC-Handler --- Key: HBASE-7669 URL: https://issues.apache.org/jira/browse/HBASE-7669 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.4 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0, 0.94.5 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the same as META region -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler
[ https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563520#comment-13563520 ] Ted Yu commented on HBASE-7669: --- Well, build #3804 almost passed - only two tests in example module failed. ROOT region wouldn't be handled by PRI-IPC-Handler --- Key: HBASE-7669 URL: https://issues.apache.org/jira/browse/HBASE-7669 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.4 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0, 0.94.5 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the same as META region -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563528#comment-13563528 ] Ted Yu commented on HBASE-7681: --- TestCatalogTrackerOnCluster which failed in build #778 wasn't touched by your patch. TestRegionServerCoprocessorExceptionWithAbort failed in build #776. TestReplicationWithCompression.queueFailover and TestReplication.queueFailover failed in build #770. Shall we focus on tests that recently failed and try to find root cause ? I think TestNodeHealthCheckChore can be modified so that it passes reliably - it is a small test whose failure would stop build early. Increase timeouts in some more test...? --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563545#comment-13563545 ] Lars Hofhansl commented on HBASE-7681: -- Personally I spent more time than I should fixing tests during the past month or so; I won't have much more time spending on this. You think this patch is not commit worthy, Ted. I'd be fine closing this as Won't fix if you don't think it'll help. Increase timeouts in some more test...? --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563546#comment-13563546 ] Ted Yu commented on HBASE-7681: --- For 0.94, on one hand we want to make it more stable. Yet, more features are coming in. I think we can enhance test-patch.sh so that it recognizes designated pattern in patch filename and run 0.94 test suite. e.g. when filename contains [^0-9]0.94[^0-9], 0.94 code would be checked out. I am running 0.94 test suite locally to see if I can reproduce hanging test(s). I will also dig into the above mentioned test failures. Increase timeouts in some more test...? --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563549#comment-13563549 ] Lars Hofhansl commented on HBASE-7681: -- Thanks Ted. As I said on the mailing list, these seem to be qualitative different from the what we've seen previously. In the previous round of test fixes I did, I was able to reproduce the failures locally by running them in a loop locally long enough. With this new round of failures that does not work, I ran run the tests in a loop 100s of time and they do not fail. I wonder if something changed with jenkins VMs (such as them being suspended for long periods of time, which would hose any test relying on wall clock time directly or indirectly). Increase timeouts in some more test...? --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563564#comment-13563564 ] Ted Yu commented on HBASE-7681: --- From https://builds.apache.org/job/HBase-0.94/776/testReport/junit/org.apache.hadoop.hbase.coprocessor/TestRegionServerCoprocessorExceptionWithAbort/testExceptionFromCoprocessorDuringPut/ : {code} 013-01-26 01:03:43,509 ERROR [RS_OPEN_REGION-juno.apache.org,41700,1359162211501-1] coprocessor.CoprocessorHost$Environment(657): Error starting coprocessor org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort$BuggyRegionObserver org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row out of range for row lock on HRegion observed_table,,1359162223098.d367ba98501279646d44c0c1530fb6c4., startKey='', getEndKey()='bbb', row='some row' at org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:3191) at org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:3240) at org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:3336) at org.apache.hadoop.hbase.coprocessor.SimpleRegionObserver.start(SimpleRegionObserver.java:110) at org.apache.hadoop.hbase.coprocessor.CoprocessorHost$Environment.startup(CoprocessorHost.java:654) at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.loadInstance(CoprocessorHost.java:312) at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.loadSystemCoprocessors(CoprocessorHost.java:149) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.init(RegionCoprocessorHost.java:144) at org.apache.hadoop.hbase.regionserver.HRegion.init(HRegion.java:464) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3937) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4118) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169) {code} SimpleRegionObserver is inherited by BuggyRegionObserver. If coprocessor cannot start, this test wouldn't succeed. Here is SimpleRegionObserver#start(): {code} public void start(CoprocessorEnvironment e) throws IOException { // this only makes sure that leases and locks are available to coprocessors ... Integer lid = re.getRegion().getLock(null, Bytes.toBytes(some row), true); re.getRegion().releaseRowLock(lid); {code} We should at least wrap the call to getLock() with try/catch. Increase timeouts in some more test...? --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4709) Hadoop metrics2 setup in test MiniDFSClusters spewing JMX errors
[ https://issues.apache.org/jira/browse/HBASE-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563565#comment-13563565 ] Ted Yu commented on HBASE-4709: --- InstanceAlreadyExistsException appears in 0.94 test output as well e.g. https://builds.apache.org/job/HBase-0.94/776/testReport/junit/org.apache.hadoop.hbase.coprocessor/TestRegionServerCoprocessorExceptionWithAbort/testExceptionFromCoprocessorDuringPut/ Shall we port the fix to 0.94 ? Hadoop metrics2 setup in test MiniDFSClusters spewing JMX errors Key: HBASE-4709 URL: https://issues.apache.org/jira/browse/HBASE-4709 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.90.4, 0.92.0, 0.94.0, 0.94.1, 0.96.0 Reporter: Gary Helmling Priority: Minor Fix For: 0.96.0 Attachments: 4709.v2.patch, 4709.v2.patch, 4709_workaround.v1.patch Since switching over HBase to build with Hadoop 0.20.205.0, we've been getting a lot of metrics related errors in the log files for tests: {noformat} 2011-10-30 22:00:22,858 INFO [main] log.Slf4jLog(67): jetty-6.1.26 2011-10-30 22:00:22,871 INFO [main] log.Slf4jLog(67): Extract jar:file:/home/jenkins/.m2/repository/org/apache/hadoop/hadoop-core/0.20.205.0/hadoop-core-0.20.205.0.jar!/webapps/datanode to /tmp/Jetty_localhost_55751_datanode.kw16hy/webapp 2011-10-30 22:00:23,048 INFO [main] log.Slf4jLog(67): Started SelectChannelConnector@localhost:55751 Starting DataNode 1 with dfs.data.dir: /home/jenkins/jenkins-slave/workspace/HBase-TRUNK/trunk/target/test-data/7ba65a16-03ad-4624-b769-57405945ef58/dfscluster_3775fc23-1b51-4966-8133-205564bae762/dfs/data/data3,/home/jenkins/jenkins-slave/workspace/HBase-TRUNK/trunk/target/test-data/7ba65a16-03ad-4624-b769-57405945ef58/dfscluster_3775fc23-1b51-4966-8133-205564bae762/dfs/data/data4 2011-10-30 22:00:23,237 WARN [main] impl.MetricsSystemImpl(137): Metrics system not started: Cannot locate configuration: tried hadoop-metrics2-datanode.properties, hadoop-metrics2.properties 2011-10-30 22:00:23,237 WARN [main] util.MBeans(59): Hadoop:service=DataNode,name=MetricsSystem,sub=Control javax.management.InstanceAlreadyExistsException: MXBean already registered with name Hadoop:service=NameNode,name=MetricsSystem,sub=Control at com.sun.jmx.mbeanserver.MXBeanLookup.addReference(MXBeanLookup.java:120) at com.sun.jmx.mbeanserver.MXBeanSupport.register(MXBeanSupport.java:143) at com.sun.jmx.mbeanserver.MBeanSupport.preRegister2(MBeanSupport.java:183) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:941) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:917) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:312) at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:482) at org.apache.hadoop.metrics2.util.MBeans.register(MBeans.java:56) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.initSystemMBean(MetricsSystemImpl.java:500) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:140) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:40) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:50) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1483) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1459) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:280) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniDFSCluster(HBaseTestingUtility.java:349) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:518) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:474) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:461) {noformat} This seems to be due to errors initializing the new hadoop metrics2 code by default, when running in a mini cluster. The errors themselves seem to be harmless -- they're not breaking any tests -- but we should figure out what configuration we need to eliminate them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563580#comment-13563580 ] Lars Hofhansl commented on HBASE-7681: -- Interesting. So that is only failing when we load that coprocessor into a region which has a non-empty start key some row or a non-empty end key some row. I added that part to SimpleRegionObserver. We should just remove it. It is just making sure that nobody makes HRegion.getLock(...) private again. Increase timeouts in some more test...? --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
Matteo Bertozzi created HBASE-7682: -- Summary: Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Summary: Investigate recent random test failures in 0.94 (was: Increase timeouts in some more test...?) Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
[ https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-7682: --- Attachment: HBASE-7682-v0.patch Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() --- Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563581#comment-13563581 ] Lars Hofhansl commented on HBASE-7681: -- In the latest run I see this a failure in TestImportTsv caused by this: {code} 2013-01-26 18:05:13,972 ERROR [pool-30-thread-1] mapred.JobTracker(4228): Job initialization failed: java.io.IOException: Number of maps in JobConf doesn't match number of recieved splits for job job_20130126180500315_0001! numMapTasks=2, #splits=1 at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:703) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4207) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} Increase timeouts in some more test...? --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
[ https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-7682: --- Status: Patch Available (was: Open) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() --- Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7507) Make memstore flush be able to retry after exception
[ https://issues.apache.org/jira/browse/HBASE-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563583#comment-13563583 ] Lars Hofhansl commented on HBASE-7507: -- I think this could be an candidate for revert in light of recent unspecific test failures. I'll double check before I do that. Make memstore flush be able to retry after exception Key: HBASE-7507 URL: https://issues.apache.org/jira/browse/HBASE-7507 Project: HBase Issue Type: Bug Affects Versions: 0.94.3 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0, 0.94.5 Attachments: 7507-94.patch, 7507-trunk v1.patch, 7507-trunk v2.patch, 7507-trunkv3.patch We will abort regionserver if memstore flush throws exception. I thinks we could do retry to make regionserver more stable because file system may be not ok in a transient time. e.g. Switching namenode in the NamenodeHA environment {code} HRegion#internalFlushcache(){ ... try { ... }catch(Throwable t){ DroppedSnapshotException dse = new DroppedSnapshotException(region: + Bytes.toStringBinary(getRegionName())); dse.initCause(t); throw dse; } ... } MemStoreFlusher#flushRegion(){ ... region.flushcache(); ... try { }catch(DroppedSnapshotException ex){ server.abort(Replay of HLog required. Forcing server shutdown, ex); } ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7681: -- Attachment: 7681-94-v1.txt Work in progress. Please let me know the test you're working on so that the progress is faster. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563590#comment-13563590 ] Lars Hofhansl commented on HBASE-7681: -- Did you mean to use org.mortbay.log.Log? Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7681: -- Attachment: 7681-94-v2.txt Patch v2 corrects logging. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563593#comment-13563593 ] Ted Yu commented on HBASE-7681: --- w.r.t. the mismatch between number of maps and splits, here is how JobConf retrieves number of maps: {code} public int getNumMapTasks() { return getInt(mapred.map.tasks, 1); } {code} I searched 0.94 code base and didn't see the above config param. Looks like this error may be out of our control. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7681: -- Attachment: 7681-94-v3.txt For TestNodeHealthCheckChore, I lifted the call to checker.init() earlier so that there is enough time for the checker to initialize (hopefully). Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
[ https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563599#comment-13563599 ] Ted Yu commented on HBASE-7682: --- +1 on patch. Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() --- Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
[ https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563607#comment-13563607 ] Hadoop QA commented on HBASE-7682: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566635/HBASE-7682-v0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4199//console This message is automatically generated. Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() --- Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7681: -- Attachment: 7681-trunk-v1.txt Trunk patch v1 mirrors patch v3 for 0.94 Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7681: -- Status: Patch Available (was: Open) Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563627#comment-13563627 ] Hadoop QA commented on HBASE-7681: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566640/7681-trunk-v1.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4200//console This message is automatically generated. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563633#comment-13563633 ] Lars Hofhansl commented on HBASE-7681: -- Cool. I think we should include my change to TestAdmin (I've seen this problem in TestSplitTransactionOnCluster, where we'd play with splitting, etc, many failed assertions would be masked by the test hanging in the finally clause, because the table could not be deleted). Similarly I think the timeout increases I had there are also valid (in that they keep the test from looping forever, but will be less likely to interfere with a successful run). Lemme make a combined patch. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563634#comment-13563634 ] Lars Hofhansl commented on HBASE-7681: -- bq. Looks like this error may be out of our control. Yeah, I came to the same conclusion. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7507) Make memstore flush be able to retry after exception
[ https://issues.apache.org/jira/browse/HBASE-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563639#comment-13563639 ] Lars Hofhansl commented on HBASE-7507: -- I would like to revert this from 0.94. Unless we wrap every write to HDFS inside a retry loop we have not gained anything anyway. Please speak up if you disagree. Make memstore flush be able to retry after exception Key: HBASE-7507 URL: https://issues.apache.org/jira/browse/HBASE-7507 Project: HBase Issue Type: Bug Affects Versions: 0.94.3 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0, 0.94.5 Attachments: 7507-94.patch, 7507-trunk v1.patch, 7507-trunk v2.patch, 7507-trunkv3.patch We will abort regionserver if memstore flush throws exception. I thinks we could do retry to make regionserver more stable because file system may be not ok in a transient time. e.g. Switching namenode in the NamenodeHA environment {code} HRegion#internalFlushcache(){ ... try { ... }catch(Throwable t){ DroppedSnapshotException dse = new DroppedSnapshotException(region: + Bytes.toStringBinary(getRegionName())); dse.initCause(t); throw dse; } ... } MemStoreFlusher#flushRegion(){ ... region.flushcache(); ... try { }catch(DroppedSnapshotException ex){ server.abort(Replay of HLog required. Forcing server shutdown, ex); } ... } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563643#comment-13563643 ] Lars Hofhansl commented on HBASE-7643: -- You going to commit [~mbertozzi]? HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563644#comment-13563644 ] Ted Yu commented on HBASE-7681: --- A combined patch would be nice. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs
[ https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563645#comment-13563645 ] Lars Hofhansl commented on HBASE-3996: -- Seems this is good to go. [~saint@gmail.com], you had some concern initially, have these all been addressed? Support multiple tables and scanners as input to the mapper in map/reduce jobs -- Key: HBASE-3996 URL: https://issues.apache.org/jira/browse/HBASE-3996 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Eran Kutner Assignee: Bryan Baugher Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 3996-v10.txt, 3996-v11.txt, 3996-v12.txt, 3996-v13.txt, 3996-v14.txt, 3996-v2.txt, 3996-v3.txt, 3996-v4.txt, 3996-v5.txt, 3996-v6.txt, 3996-v7.txt, 3996-v8.txt, 3996-v9.txt, HBase-3996.patch It seems that in many cases feeding data from multiple tables or multiple scanners on a single table can save a lot of time when running map/reduce jobs. I propose a new MultiTableInputFormat class that would allow doing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Attachment: 7681-0.94-combined.txt Combined 0.94 patch (also replace assertTrue with assertEquals, which tells us the expected value upon failure) Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Status: Open (was: Patch Available) Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563647#comment-13563647 ] Ted Yu commented on HBASE-7681: --- From https://builds.apache.org/job/HBase-0.94/781/testReport/org.apache.hadoop.hbase.regionserver/TestAtomicOperation/testRowMutationMultiThreads/: {code} 2013-01-26 19:53:55,720 INFO [Thread-114] regionserver.Store(1129): Completed major compaction of 4 file(s) in colfamily11 of testtable,,1359230031766.0e4657859f8ff751b545ee6dfd92446e. into 29f808fd896d4b78ae69c40e36ab2f1e, size=731.0; total size for store is 731.0 2013-01-26 19:53:55,721 DEBUG [Thread-109] regionserver.TestAtomicOperation$1(307): keyvalues={rowA/colfamily11:qual1/3830/Put/vlen=6/ts=4062, rowA/colfamily11:qual2/3838/Put/vlen=6/ts=0} Exception in thread Thread-109 junit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:48) at junit.framework.Assert.fail(Assert.java:56) at org.apache.hadoop.hbase.regionserver.TestAtomicOperation$1.run(TestAtomicOperation.java:309) {code} This was due to more than one column being visible: {code} // check: should always see exactly one column Get g = new Get(row); Result r = region.get(g, null); if (r.size() != 1) { LOG.debug(r); failures.incrementAndGet(); fail(); {code} Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-7643: --- Resolution: Fixed Release Note: committed to trunk and 0.94, thanks guys for the review! Status: Resolved (was: Patch Available) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-7643: --- Release Note: (was: committed to trunk and 0.94, thanks guys for the review!) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563648#comment-13563648 ] Matteo Bertozzi commented on HBASE-7643: committed to trunk and 0.94, thanks guys for the review! HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
[ https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matteo Bertozzi updated HBASE-7682: --- Resolution: Fixed Status: Resolved (was: Patch Available) committed to trunk since it's a straightforward patch Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() --- Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Attachment: 7681-0.94-combined_v2.txt Slightly better 0.94 (fixes a blank line and I forgot to change a TO from 1000 to 1500) Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Attachment: 7681-0.96-combined.txt And the 0.96 version Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Status: Patch Available (was: Open) Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563655#comment-13563655 ] Lars Hofhansl commented on HBASE-7681: -- Yeah that one (TestAtomicOperation) has me worried a bit. That one has not failed in a *very* long. And new we've had two failures within a few days. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563656#comment-13563656 ] Lars Hofhansl commented on HBASE-7681: -- The only recent change that could affect that is HBASE-7507. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Attachment: 7681-0.96-combined.txt Hadoop QA is not starting. Attaching again Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563659#comment-13563659 ] Matt Corgan commented on HBASE-7660: I think for this jira we'd only want to remove HFileReaderV1, HFileWriterV1, and FSReaderV1. The main goal being to forcibly drop support for them before branching 0.96. It would leave only one implementation of the interfaces and abstract classes. Later, we can go back and look at the purpose of the interfaces and abstract classes. I'm guessing they are partly designed specifically as a shim layer between v1/v2. A shim layer between v2 and future unknown v3 may have completely different needs. Getting out of the scope of this jira - we may want to flatten the interface, abstractClass, and implementation to make it easier to clean up the Store code, and then pull out a different interface in the future when we actually have a use case for it. Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Fix For: 0.96.0 HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563660#comment-13563660 ] Ted Yu commented on HBASE-7681: --- Combined patch looks good. {code} - waitForCounter(SplitLogCounters.tot_wkr_preempt_task, 0, 1, 1000); + waitForCounter(SplitLogCounters.tot_wkr_preempt_task, 0, 1, 1500); - waitForCounter(SplitLogCounters.tot_wkr_task_acquired, 1, 2, 1000); + waitForCounter(SplitLogCounters.tot_wkr_task_acquired, 1, 2, 1500); {code} Next time we touch TestSplitLogWorker.java, maybe introduce a constant for the timeout so that we don't need to change it in 12 places. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7681: -- Fix Version/s: 0.96.0 Hadoop Flags: Reviewed Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563663#comment-13563663 ] Matteo Bertozzi commented on HBASE-7660: I agree that the old code limits the evolution. Is there a way to move the code out without too much work, and put it in something like hbase-support/hfile-v1-to-v2-tool that can be used as migration tool? Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Fix For: 0.96.0 HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563666#comment-13563666 ] Ted Yu commented on HBASE-7660: --- Looking at HFile.Reader, most methods don't have javadoc, leaving interpretation of the interface to each developer. In the process of creating Cell interface, we have avoided such mistake. bq. put it in something like hbase-support/hfile-v1-to-v2-tool that can be used as migration tool? Some users are using 0.94.x and to my knowledge, there was no request for creating migration tool. Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Fix For: 0.96.0 HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563668#comment-13563668 ] Ted Yu commented on HBASE-7681: --- Looking at https://issues.apache.org/jira/secure/attachment/12565724/7507-94.patch : {code} +} catch (Exception e) { + LOG.warn(Failed validating store file + pathName + + , retring num= + i, e); + if (e instanceof IOException) { +lastException = (IOException) e; + } else { +lastException = new IOException(e); + } +} + } catch (IOException e) { +LOG.warn(Failed flushing store file, retring num= + i, e); {code} I searched for the two strings which start with 'Failed ' in https://builds.apache.org/job/HBase-0.94/781/testReport/org.apache.hadoop.hbase.regionserver/TestAtomicOperation/testRowMutationMultiThreads/ but didn't find any occurrence. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563669#comment-13563669 ] Ted Yu commented on HBASE-7681: --- TestAtomicOperation failed in trunk build #3806 Strangely TestAtomicOperation didn't seem to fail on QA machines. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563671#comment-13563671 ] Lars Hofhansl commented on HBASE-7681: -- I'm looping TestAtomicOperation on my machine right now. After 45mins no failure, yet. I too looked for these messages and didn't see them, but that changes the validation around (and I've seen extremely subtle failure conditions in this code before) Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
[ https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563674#comment-13563674 ] Hudson commented on HBASE-7682: --- Integrated in HBase-TRUNK #3806 (See [https://builds.apache.org/job/HBase-TRUNK/3806/]) HBASE-7682 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() (Revision 1438974) Result = FAILURE mbertozzi : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() --- Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563675#comment-13563675 ] Hudson commented on HBASE-7643: --- Integrated in HBase-TRUNK #3806 (See [https://builds.apache.org/job/HBase-TRUNK/3806/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438972) Result = FAILURE mbertozzi : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563676#comment-13563676 ] Matt Corgan commented on HBASE-7660: Should have mentioned in the description: One could probably reason that if someone is using 0.90.x right now, they are looking for max stability and will not jump straight to 0.96, but rather go to 0.94.5+ first. Regarding distributing the script: We could bundle the HFileV1-scanner script in future 0.94 version as well as make it downloadable to hbase/bin from the hbase website for people on 0.92 and earlier 0.94 versions. {quote}Is there a way to move the code out without too much work, and put it in something like hbase-support/hfile-v1-to-v2-tool that can be used as migration tool?{quote}Sounds like a good idea, though I wonder if anyone should spend time creating it before anyone actually requests it. Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Fix For: 0.96.0 HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563678#comment-13563678 ] Hadoop QA commented on HBASE-7681: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566648/7681-0.96-combined.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 27 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestZKBasedOpenCloseRegion Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4201//console This message is automatically generated. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563681#comment-13563681 ] Lars Hofhansl commented on HBASE-7681: -- Test failure's unrelated. Good to commit, Ted? Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563682#comment-13563682 ] Lars Hofhansl commented on HBASE-7681: -- TestAtomicOperation still looping, btw. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563683#comment-13563683 ] Hudson commented on HBASE-7643: --- Integrated in HBase-0.94 #782 (See [https://builds.apache.org/job/HBase-0.94/782/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438973) Result = FAILURE mbertozzi : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563686#comment-13563686 ] Ted Yu commented on HBASE-7681: --- Go ahead, Lars. Investigate recent random test failures in 0.94 --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
[ https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563690#comment-13563690 ] Hudson commented on HBASE-7682: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #377 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/377/]) HBASE-7682 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() (Revision 1438974) Result = FAILURE mbertozzi : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java Replace HRegion custom File-System debug, with FSUtils.logFileSystemState() --- Key: HBASE-7682 URL: https://issues.apache.org/jira/browse/HBASE-7682 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Trivial Fix For: 0.96.0 Attachments: HBASE-7682-v0.patch We can remove some code from HRegion by using the new FSUtils.logFileSystemState() instead of the custom HRegion.listPaths() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563691#comment-13563691 ] Hudson commented on HBASE-7643: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #377 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/377/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438972) Result = FAILURE mbertozzi : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Investigate recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Summary: Investigate recent random test failures (was: Investigate recent random test failures in 0.94) Investigate recent random test failures --- Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Summary: Address some recent random test failures (was: Investigate recent random test failures) Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563693#comment-13563693 ] Himanshu Vashishtha commented on HBASE-7681: Since we are talking about test cases health here, I was looking at TestRegionServerMetrics a while ago. It opens bunch of HTable instances in number of test methods and doesn't close them. Is it possible to fold that here too, or a separate jira... it will be a trivial fix. Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563694#comment-13563694 ] Lars Hofhansl commented on HBASE-7681: -- Hrmm... Just committed. Let's do an addendum here. Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Status: Open (was: Patch Available) Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7681: - Attachment: 7681-0.96-addendum.txt 7681-0.94-addendum.txt Something like this Himanshu? Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563697#comment-13563697 ] Andrew Purtell commented on HBASE-7660: --- bq. I think for this jira we'd only want to remove HFileReaderV1, HFileWriterV1, and FSReaderV1. The main goal being to forcibly drop support for them before branching 0.96. This sounds like a good idea. I was worried about more substantial changes because I am currently working in HFile V2. That said, with V1 gone it would be a good thing that we can simplify and remove code in this area over time. bq. hbase-support/hfile-v1-to-v2-tool that can be used as migration tool This would seem like working against the goal to remove the old code. I think anyone upgrading to 0.96 would be doing it from 0.92 or 0.94, which auto-upgrades HFiles. We could document in release notes that before upgrading the user should run a major compaction. Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Fix For: 0.96.0 HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563698#comment-13563698 ] Lars Hofhansl commented on HBASE-7681: -- [~v.himanshu]? :) Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7660: -- Attachment: 7660-v1.txt Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Fix For: 0.96.0 Attachments: 7660-v1.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reassigned HBASE-7660: - Assignee: Ted Yu Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Assignee: Ted Yu Fix For: 0.96.0 Attachments: 7660-v1.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7660: -- Status: Patch Available (was: Open) Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Assignee: Ted Yu Fix For: 0.96.0 Attachments: 7660-v1.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563705#comment-13563705 ] Hadoop QA commented on HBASE-7660: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566651/7660-v1.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestStoreFile org.apache.hadoop.hbase.regionserver.TestSplitTransaction org.apache.hadoop.hbase.io.hfile.TestHFileReaderV1 Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4202//console This message is automatically generated. Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Assignee: Ted Yu Fix For: 0.96.0 Attachments: 7660-v1.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563706#comment-13563706 ] Hudson commented on HBASE-7681: --- Integrated in HBase-0.94 #783 (See [https://builds.apache.org/job/HBase-0.94/783/]) HBASE-7681 Address some recent random test failures (Revision 1439004) Result = FAILURE larsh : Files : * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestNodeHealthCheckChore.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/coprocessor/SimpleRegionObserver.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitLogWorker.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/thrift/TestThriftServerCmdLine.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7660: -- Attachment: 7660-v2.txt [~stack]: Can you provide your opinion ? Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Assignee: Ted Yu Fix For: 0.96.0 Attachments: 7660-v1.txt, 7660-v2.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563710#comment-13563710 ] Hudson commented on HBASE-7681: --- Integrated in HBase-TRUNK #3807 (See [https://builds.apache.org/job/HBase-TRUNK/3807/]) HBASE-7681 Address some recent random test failures (Revision 1439003) Result = FAILURE larsh : Files : * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/TestNodeHealthCheckChore.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/SimpleRegionObserver.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitLogWorker.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/thrift/TestThriftServerCmdLine.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563718#comment-13563718 ] Hadoop QA commented on HBASE-7660: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566653/7660-v2.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 21 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4203//console This message is automatically generated. Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Assignee: Ted Yu Fix For: 0.96.0 Attachments: 7660-v1.txt, 7660-v2.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563719#comment-13563719 ] Himanshu Vashishtha commented on HBASE-7681: [~lhofhansl] Yea Sorry for the delay, was away. Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563724#comment-13563724 ] Matt Corgan commented on HBASE-7660: Looks good to me Ted. Thanks for spending time on it. Only comment is that maybe you should preserve the constant HFileWriterV1.BLOOM_FILTER_DATA_KEY. Unless you had a specific reason, might be better to move it to HFileWriterV2 instead of the following: {code} - bloom = reader.getMetaBlock(HFileWriterV1.BLOOM_FILTER_DATA_KEY, + bloom = reader.getMetaBlock(BLOOM_FILTER_DATA, true); {code} Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Assignee: Ted Yu Fix For: 0.96.0 Attachments: 7660-v1.txt, 7660-v2.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7660) Remove HFileV1 code
[ https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-7660: -- Attachment: 7660-v3.txt Remove HFileV1 code --- Key: HBASE-7660 URL: https://issues.apache.org/jira/browse/HBASE-7660 Project: HBase Issue Type: Improvement Components: hbck, HFile, migration Reporter: Matt Corgan Assignee: Ted Yu Fix For: 0.96.0 Attachments: 7660-v1.txt, 7660-v2.txt, 7660-v3.txt HFileV1 should be removed from the regionserver because it is somewhat of a drag on development for working on the lower level read paths. It's an impediment to cleaning up the Store code. V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place so users could upgrade from 0.90 to 0.92. Once all HFiles are compacted in 0.92, then the V1 code is no longer needed. We then decided to leave the V1 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94. The code is still there in trunk but should probably be shown the door. I see a few options: 1) just delete the code and tell people to make sure they compact everything using 0.92 or 0.94 2) create a standalone script that people can run on their 0.92 or 0.94 cluster that iterates the filesystem and prints out any v1 files with a message that the user should run a major compaction 3) add functionality to 0.96.0 (first release, maybe in hbck) that proactively kills v1 files, so that we can be sure there are none when upgrading from 0.96 to 0.98 4) punt to 0.98 and probably do one of the above options in a year I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0. HFileV1 has already survived 2 major release upgrades which i think many would agree is more than enough for a pre-1.0, free product. If we can remove it in 0.96.0 it will be out of the way to introduce some nice performance improvements in subsequent 0.96.x releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563731#comment-13563731 ] Hudson commented on HBASE-7681: --- Integrated in HBase-0.94-security #99 (See [https://builds.apache.org/job/HBase-0.94-security/99/]) HBASE-7681 Address some recent random test failures (Revision 1439004) Result = FAILURE larsh : Files : * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestNodeHealthCheckChore.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/coprocessor/SimpleRegionObserver.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitLogWorker.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/thrift/TestThriftServerCmdLine.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler
[ https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563732#comment-13563732 ] Hudson commented on HBASE-7669: --- Integrated in HBase-0.94-security #99 (See [https://builds.apache.org/job/HBase-0.94-security/99/]) HBASE-7669 ROOT region wouldn't be handled by PRI-IPC-Handler(Chunhui) (Revision 1438832) Result = FAILURE zjushch : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java ROOT region wouldn't be handled by PRI-IPC-Handler --- Key: HBASE-7669 URL: https://issues.apache.org/jira/browse/HBASE-7669 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.4 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.96.0, 0.94.5 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the same as META region -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss
[ https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563733#comment-13563733 ] Hudson commented on HBASE-7643: --- Integrated in HBase-0.94-security #99 (See [https://builds.apache.org/job/HBase-0.94-security/99/]) HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss (Revision 1438973) Result = FAILURE mbertozzi : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss --- Key: HBASE-7643 URL: https://issues.apache.org/jira/browse/HBASE-7643 Project: HBase Issue Type: Bug Affects Versions: hbase-6055, 0.96.0 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Blocker Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch * The master have an hfile cleaner thread (that is responsible for cleaning the /hbase/.archive dir) ** /hbase/.archive/table/region/family/hfile ** if the family/region/family directory is empty the cleaner removes it * The master can archive files (from another thread, e.g. DeleteTableHandler) * The region can archive files (from another server/process, e.g. compaction) The simplified file archiving code looks like this: {code} HFileArchiver.resolveAndArchive(...) { // ensure that the archive dir exists fs.mkdir(archiveDir); // move the file to the archiver success = fs.rename(originalPath/fileName, archiveDir/fileName) // if the rename is failed, delete the file without archiving if (!success) fs.delete(originalPath/fileName); } {code} Since there's no synchronization between HFileArchiver.resolveAndArchive() and the cleaner run (different process, thread, ...) you can end up in the situation where you are moving something in a directory that doesn't exists. {code} fs.mkdir(archiveDir); // HFileCleaner chore starts at this point // and the archiveDirectory that we just ensured to be present gets removed. // The rename at this point will fail since the parent directory is missing. success = fs.rename(originalPath/fileName, archiveDir/fileName) {code} The bad thing of deleting the file without archiving is that if you've a snapshot that relies on the file to be present, or you've a clone table that relies on that file is that you're losing data. Possible solutions * Create a ZooKeeper lock, to notify the master (Hey I'm archiving something, wait a bit) * Add a RS - Master call to let the master removes files and avoid this kind of situations * Avoid to remove empty directories from the archive if the table exists or is not disabled * Add a try catch around the fs.rename The last one, the easiest one, looks like: {code} for (int i = 0; i retries; ++i) { // ensure archive directory to be present fs.mkdir(archiveDir); // possible race - // try to archive file success = fs.rename(originalPath/fileName, archiveDir/fileName); if (success) break; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7654) Add ListString getCoprocessors() to HTableDescriptor
[ https://issues.apache.org/jira/browse/HBASE-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563734#comment-13563734 ] Hudson commented on HBASE-7654: --- Integrated in HBase-0.94-security #99 (See [https://builds.apache.org/job/HBase-0.94-security/99/]) HBASE-7654. Add ListString getCoprocessors() to HTableDescriptor (Jean-Marc Spaggiari) (Revision 1438790) Result = FAILURE apurtell : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestHTableDescriptor.java Add ListString getCoprocessors() to HTableDescriptor -- Key: HBASE-7654 URL: https://issues.apache.org/jira/browse/HBASE-7654 Project: HBase Issue Type: Bug Components: Client, Coprocessors Affects Versions: 0.96.0, 0.94.5 Reporter: Jean-Marc Spaggiari Assignee: Jean-Marc Spaggiari Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7654-v0-0.94.patch, HBASE-7654-v0-trunk.patch Add ListString getCoprocessors() to HTableInterface to retreive the list of coprocessors loaded into this table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563737#comment-13563737 ] Hudson commented on HBASE-7681: --- Integrated in HBase-0.94-security #100 (See [https://builds.apache.org/job/HBase-0.94-security/100/]) HBASE-7681 Addendum, close tables in TestRegionServerMetrics (Revision 1439025) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563740#comment-13563740 ] Hudson commented on HBASE-7681: --- Integrated in HBase-0.94 #786 (See [https://builds.apache.org/job/HBase-0.94/786/]) HBASE-7681 Addendum, close tables in TestRegionServerMetrics (Revision 1439025) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7683) TestMasterObserver fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HBASE-7683: - Description: {code} org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' does not exist at org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182) at org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43) at org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) {code} was: {code} org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' does not exist at org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182) at org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43) at org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) {code} TestMasterObserver fails occasionally - Key: HBASE-7683 URL: https://issues.apache.org/jira/browse/HBASE-7683 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl {code} org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' does not exist at org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182) at org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43) at org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7683) TestMasterObserver fails occasionally
Lars Hofhansl created HBASE-7683: Summary: TestMasterObserver fails occasionally Key: HBASE-7683 URL: https://issues.apache.org/jira/browse/HBASE-7683 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl {code} org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' does not exist at org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182) at org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43) at org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7683) TestMasterObserver fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563744#comment-13563744 ] Lars Hofhansl commented on HBASE-7683: -- I think this log snippet is interesting. The sequence is this: # updating /observed_table/.tableinfo.03 # updating /observed_table/.tableinfo.04 # updating /observed_table/.tableinfo.05 # updating /observed_table/.tableinfo.06 # failed to delete /observed_table/.tableinfo.06 # updating /observed_table/.tableinfo.02 # cleaned up /observed_table/.tableinfo.02 So it seems version 2 was somehow updated out of sync {code} 2013-01-27 05:22:57,737 DEBUG [IPC Server handler 1 on 41253] client.ClientScanner(185): Finished with scanning at {NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} 2013-01-27 05:22:57,737 INFO [IPC Server handler 1 on 41253] master.MasterFileSystem(541): AddColumn. Table = observed_table HCD = {NAME = 'fam2', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} 2013-01-27 05:22:57,740 DEBUG [IPC Server handler 1 on 41253] util.FSTableDescriptors(501): hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tmp/.tableinfo.02 exists; retrying up to 10 times 2013-01-27 05:22:57,750 INFO [IPC Server handler 1 on 41253] util.FSTableDescriptors(452): Updated tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.03 2013-01-27 05:22:57,761 INFO [IPC Server handler 1 on 41253] util.FSTableDescriptors(452): Updated tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.04 2013-01-27 05:22:57,764 DEBUG [IPC Server handler 0 on 41253] client.ClientScanner(90): Creating scanner over .META. starting at key 'observed_table,,' 2013-01-27 05:22:57,764 DEBUG [IPC Server handler 0 on 41253] client.ClientScanner(198): Advancing internal scanner to startKey at 'observed_table,,' 2013-01-27 05:22:57,767 INFO [IPC Server handler 0 on 41253] handler.TableEventHandler(91): Handling table operation C_M_MODIFY_FAMILY on table observed_table 2013-01-27 05:22:57,768 DEBUG [IPC Server handler 0 on 41253] client.ClientScanner(90): Creating scanner over .META. starting at key 'observed_table,,' 2013-01-27 05:22:57,768 DEBUG [IPC Server handler 0 on 41253] client.ClientScanner(198): Advancing internal scanner to startKey at 'observed_table,,' 2013-01-27 05:22:57,771 DEBUG [IPC Server handler 0 on 41253] client.ClientScanner(185): Finished with scanning at {NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} 2013-01-27 05:22:57,771 INFO [IPC Server handler 0 on 41253] master.MasterFileSystem(518): AddModifyColumn. Table = observed_table HCD = {NAME = 'fam2', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', VERSIONS = '25', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'} 2013-01-27 05:22:57,787 INFO [IPC Server handler 0 on 41253] util.FSTableDescriptors(452): Updated tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.05 2013-01-27 05:22:57,799 INFO [IPC Server handler 0 on 41253] util.FSTableDescriptors(452): Updated tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.06 2013-01-27 05:22:57,801 DEBUG [IPC Server handler 1 on 41253] client.ClientScanner(90): Creating scanner over .META. starting at key 'observed_table,,' 2013-01-27 05:22:57,801 DEBUG [IPC Server handler 1 on 41253] client.ClientScanner(198): Advancing internal scanner to startKey at 'observed_table,,' 2013-01-27 05:22:57,808 INFO [pool-1-thread-1] client.HBaseAdmin(727): Started enable of observed_table 2013-01-27 05:22:57,809 DEBUG [pool-1-thread-1] zookeeper.ZKUtil(1571): hconnection-0x13c7a752b7c0006 Retrieved 8 byte(s) of data from znode /hbase/table/observed_table; data=ENABLING 2013-01-27 05:22:57,809 DEBUG [pool-1-thread-1] client.HBaseAdmin(684): Sleeping= 1000ms, waiting for all regions to be enabled in observed_table 2013-01-27 05:22:58,136 WARN [MASTER_TABLE_OPERATIONS-hemera.apache.org,41253,1359264165304-0] util.FSTableDescriptors(522): Failed delete of hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.01; continuing 2013-01-27 05:22:58,136 INFO [MASTER_TABLE_OPERATIONS-hemera.apache.org,41253,1359264165304-0] util.FSTableDescriptors(452): Updated tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.02 2013-01-27 05:22:58,138 DEBUG
[jira] [Commented] (HBASE-7683) TestMasterObserver fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563745#comment-13563745 ] Lars Hofhansl commented on HBASE-7683: -- If someone feels like looking at this that would be greatly appreciated, as I have exhausted my goodwill time to work on tests for a bit. TestMasterObserver fails occasionally - Key: HBASE-7683 URL: https://issues.apache.org/jira/browse/HBASE-7683 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl {code} org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' does not exist at org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182) at org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43) at org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl resolved HBASE-7681. -- Resolution: Fixed Committed to 0.94 and 0.96 Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7681) Address some recent random test failures
[ https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563748#comment-13563748 ] Hudson commented on HBASE-7681: --- Integrated in HBase-TRUNK #3809 (See [https://builds.apache.org/job/HBase-TRUNK/3809/]) HBASE-7681 Addendum, close tables in TestRegionServerMetrics (Revision 1439026) Result = FAILURE larsh : Files : * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java Address some recent random test failures Key: HBASE-7681 URL: https://issues.apache.org/jira/browse/HBASE-7681 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Assignee: Lars Hofhansl Fix For: 0.96.0, 0.94.5 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt I've seen many unspecific test failures recently that cannot be reproduced locally even when running these test is a loop for a very long time. Many of these test one way or the other make assumption w.r.t. wall clock time. While I cannot fix that, an option to increase some of these timeout a bit. This issue is to remind me to do that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira