[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563404#comment-13563404
 ] 

Hudson commented on HBASE-7669:
---

Integrated in HBase-TRUNK #3802 (See 
[https://builds.apache.org/job/HBase-TRUNK/3802/])
HBASE-7669 Addendum fixes TestPriorityRpc (Revision 1438853)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestPriorityRpc.java


 ROOT region wouldn't  be handled by PRI-IPC-Handler
 ---

 Key: HBASE-7669
 URL: https://issues.apache.org/jira/browse/HBASE-7669
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.4
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.96.0, 0.94.5

 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch


 RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the 
 same as META region

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7671) Flushing memstore again after last failure could cause data loss

2013-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563405#comment-13563405
 ] 

Hadoop QA commented on HBASE-7671:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12566619/HBASE-7671v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.security.access.TestTablePermissions

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4198//console

This message is automatically generated.

 Flushing memstore again after last failure could cause data loss
 

 Key: HBASE-7671
 URL: https://issues.apache.org/jira/browse/HBASE-7671
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.4
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7671.patch, HBASE-7671v2.patch, HBASE-7671v3.patch


 See the following logs first:
 {code}
 2013-01-23 18:58:38,801 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Flushed , sequenceid=9746535080, memsize=101.8m, into tmp file 
 hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/bebeeecc56364b6c8126cf1dc6782a25
 2013-01-23 18:58:41,982 WARN org.apache.hadoop.hbase.regionserver.MemStore: 
 Snapshot called again without clearing previous. Doing nothing. Another 
 ongoing flush or did we fail last attempt?
 2013-01-23 18:58:43,274 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Flushed , sequenceid=9746599334, memsize=101.8m, into tmp file 
 hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/4eede32dc469480bb3d469aaff332313
 {code}
 The first time memstore flush is failed when commitFile()(Logged the first 
 edit above), then trigger server abort, but another flush is coming 
 immediately(could caused by move/split,Logged the third edit above) and 
 successful.
 For the same memstore's snapshot, we get different sequenceid, it causes data 
 loss when replaying log edits
 See details from the unit test case in the patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7521) fix HBASE-6060 (regions stuck in opening state) in 0.94

2013-01-26 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563406#comment-13563406
 ] 

ramkrishna.s.vasudevan commented on HBASE-7521:
---

@Sergey
Went thro the patch.  It seems fine wrt to the changes taken up from the old 
patch.
Just after looking into the patch now i feel may be we can rename some 
variables like regionsFromRIT, regionsIntersection. Is it possible to rename so 
that the name suggests what they are?
Overall looks fine.  
One question is did you see what is the impact on now making the RS to 
transition the node to OPENING? Thanks Sergey

 fix HBASE-6060 (regions stuck in opening state) in 0.94
 ---

 Key: HBASE-7521
 URL: https://issues.apache.org/jira/browse/HBASE-7521
 Project: HBase
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Attachments: HBASE-7521-original-patch-ported-v0.patch, 
 HBASE-7521-v0.patch, HBASE-7521-v1.patch


 Discussion in HBASE-6060 implies that the fix there does not work on 0.94. 
 Still, we may want to fix the issue in 0.94 (via some different fix) because 
 the regions stuck in opening for ridiculous amounts of time is not a good 
 thing to have.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7671) Flushing memstore again after last failure could cause data loss

2013-01-26 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563421#comment-13563421
 ] 

ramkrishna.s.vasudevan commented on HBASE-7671:
---

I will review this today and get back. Atleast to understand the problem

 Flushing memstore again after last failure could cause data loss
 

 Key: HBASE-7671
 URL: https://issues.apache.org/jira/browse/HBASE-7671
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.4
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7671.patch, HBASE-7671v2.patch, HBASE-7671v3.patch


 See the following logs first:
 {code}
 2013-01-23 18:58:38,801 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Flushed , sequenceid=9746535080, memsize=101.8m, into tmp file 
 hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/bebeeecc56364b6c8126cf1dc6782a25
 2013-01-23 18:58:41,982 WARN org.apache.hadoop.hbase.regionserver.MemStore: 
 Snapshot called again without clearing previous. Doing nothing. Another 
 ongoing flush or did we fail last attempt?
 2013-01-23 18:58:43,274 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Flushed , sequenceid=9746599334, memsize=101.8m, into tmp file 
 hdfs://dw77.kgb.sqa.cm4:9900/hbase-test3/writetest1/8dc14e35b4d7c0e481e0bb30849cff7d/.tmp/4eede32dc469480bb3d469aaff332313
 {code}
 The first time memstore flush is failed when commitFile()(Logged the first 
 edit above), then trigger server abort, but another flush is coming 
 immediately(could caused by move/split,Logged the third edit above) and 
 successful.
 For the same memstore's snapshot, we get different sequenceid, it causes data 
 loss when replaying log edits
 See details from the unit test case in the patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563470#comment-13563470
 ] 

Hudson commented on HBASE-7669:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #376 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/376/])
HBASE-7669 Addendum fixes TestPriorityRpc (Revision 1438853)
HBASE-7669 ROOT region wouldn't be handled by PRI-IPC-Handler(Chunhui) 
(Revision 1438834)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestPriorityRpc.java

zjushch : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


 ROOT region wouldn't  be handled by PRI-IPC-Handler
 ---

 Key: HBASE-7669
 URL: https://issues.apache.org/jira/browse/HBASE-7669
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.4
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.96.0, 0.94.5

 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch


 RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the 
 same as META region

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7654) Add ListString getCoprocessors() to HTableDescriptor

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563471#comment-13563471
 ] 

Hudson commented on HBASE-7654:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #376 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/376/])
HBASE-7654. Add ListString getCoprocessors() to HTableDescriptor 
(Jean-Marc Spaggiari) (Revision 1438789)

 Result = FAILURE
apurtell : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/TestHTableDescriptor.java


 Add ListString getCoprocessors() to HTableDescriptor
 --

 Key: HBASE-7654
 URL: https://issues.apache.org/jira/browse/HBASE-7654
 Project: HBase
  Issue Type: Bug
  Components: Client, Coprocessors
Affects Versions: 0.96.0, 0.94.5
Reporter: Jean-Marc Spaggiari
Assignee: Jean-Marc Spaggiari
Priority: Critical
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7654-v0-0.94.patch, HBASE-7654-v0-trunk.patch


 Add ListString getCoprocessors() to HTableInterface to retreive the list of 
 coprocessors loaded into this table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7516) Make compaction policy pluggable

2013-01-26 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563507#comment-13563507
 ] 

Jimmy Xiang commented on HBASE-7516:


Since HBASE-7571 is in, I will rebase and post another patch.

 Make compaction policy pluggable
 

 Key: HBASE-7516
 URL: https://issues.apache.org/jira/browse/HBASE-7516
 Project: HBase
  Issue Type: Sub-task
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HBASE-7516-v0.patch, HBASE-7516-v1.patch, 
 HBASE-7516-v2.patch, trunk-7516.patch, trunk-7516_v2.patch


 Currently, the compaction selection is pluggable. It will be great to make 
 the compaction algorithm pluggable too so that we can implement and play with 
 other compaction algorithms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563518#comment-13563518
 ] 

Ted Yu commented on HBASE-7669:
---

Since this change went in, TestSplitTransactionOnCluster has failed in trunk 
from #3801 to #3803.

I wasn't able to reproduce the test failure locally using java 1.6.

From 
https://builds.apache.org/job/HBase-TRUNK/3801/testReport/junit/org.apache.hadoop.hbase.regionserver/TestSplitTransactionOnCluster/testShutdownSimpleFixup/:
{code}
2013-01-26 05:28:56,063 ERROR 
[RS_OPEN_REGION-vesta.apache.org,40191,1359177837124-2] 
handler.OpenRegionHandler(376): Failed transitioning node 
testShutdownSimpleFixup,mnk,1359178057447.f4b0d2459f3a827fb4b950777b219e03. 
from OPENING to FAILED_OPEN
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for /hbase/region-in-transition/f4b0d2459f3a827fb4b950777b219e03
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:357)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:849)
at 
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:773)
at 
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:711)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tryTransitionFromOpeningToFailedOpen(OpenRegionHandler.java:363)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:144)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:202)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
2013-01-26 05:28:56,065 INFO  [hbase-am-pool-79-thread-4] 
master.AssignmentManager(1596): Assigning region 
testMasterRestartAtRegionSplitPendingCatalogJanitor,mnk,1359177884826.317c20e7b762e72f3423ebc44fdbd0b5.
 to vesta.apache.org,40191,1359177837124
2013-01-26 05:28:56,064 DEBUG [hbase-am-zkevent-worker-pool-80-thread-9] 
master.AssignmentManager(641): Handling transition=M_ZK_REGION_OFFLINE, 
server=vesta.apache.org,40191,1359177837124, 
region=f4b0d2459f3a827fb4b950777b219e03, current state from region state map 
={testShutdownSimpleFixup,mnk,1359178057447.f4b0d2459f3a827fb4b950777b219e03. 
state=OFFLINE, ts=1359178134450, server=null}
{code}
Not sure if the above was cause for test failure.

 ROOT region wouldn't  be handled by PRI-IPC-Handler
 ---

 Key: HBASE-7669
 URL: https://issues.apache.org/jira/browse/HBASE-7669
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.4
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.96.0, 0.94.5

 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch


 RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the 
 same as META region

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563520#comment-13563520
 ] 

Ted Yu commented on HBASE-7669:
---

Well, build #3804 almost passed - only two tests in example module failed.

 ROOT region wouldn't  be handled by PRI-IPC-Handler
 ---

 Key: HBASE-7669
 URL: https://issues.apache.org/jira/browse/HBASE-7669
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.4
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.96.0, 0.94.5

 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch


 RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the 
 same as META region

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563528#comment-13563528
 ] 

Ted Yu commented on HBASE-7681:
---

TestCatalogTrackerOnCluster which failed in build #778 wasn't touched by your 
patch.

TestRegionServerCoprocessorExceptionWithAbort failed in build #776.

TestReplicationWithCompression.queueFailover and TestReplication.queueFailover 
failed in build #770.

Shall we focus on tests that recently failed and try to find root cause ?

I think TestNodeHealthCheckChore can be modified so that it passes reliably - 
it is a small test whose failure would stop build early.

 Increase timeouts in some more test...?
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563545#comment-13563545
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Personally I spent more time than I should fixing tests during the past month 
or so; I won't have much more time spending on this.
You think this patch is not commit worthy, Ted. I'd be fine closing this as 
Won't fix if you don't think it'll help.

 Increase timeouts in some more test...?
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563546#comment-13563546
 ] 

Ted Yu commented on HBASE-7681:
---

For 0.94, on one hand we want to make it more stable. Yet, more features are 
coming in.

I think we can enhance test-patch.sh so that it recognizes designated pattern 
in patch filename and run 0.94 test suite.
e.g. when filename contains [^0-9]0.94[^0-9], 0.94 code would be checked out.

I am running 0.94 test suite locally to see if I can reproduce hanging test(s).

I will also dig into the above mentioned test failures.

 Increase timeouts in some more test...?
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563549#comment-13563549
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Thanks Ted.
As I said on the mailing list, these seem to be qualitative different from the 
what we've seen previously. In the previous round of test fixes I did, I was 
able to reproduce the failures locally by running them in a loop locally long 
enough. With this new round of failures that does not work, I ran run the tests 
in a loop 100s of time and they do not fail.

I wonder if something changed with jenkins VMs (such as them being suspended 
for long periods of time, which would hose any test relying on wall clock time 
directly or indirectly).


 Increase timeouts in some more test...?
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563564#comment-13563564
 ] 

Ted Yu commented on HBASE-7681:
---

From 
https://builds.apache.org/job/HBase-0.94/776/testReport/junit/org.apache.hadoop.hbase.coprocessor/TestRegionServerCoprocessorExceptionWithAbort/testExceptionFromCoprocessorDuringPut/
 :
{code}
013-01-26 01:03:43,509 ERROR 
[RS_OPEN_REGION-juno.apache.org,41700,1359162211501-1] 
coprocessor.CoprocessorHost$Environment(657): Error starting coprocessor 
org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort$BuggyRegionObserver
org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row out of 
range for row lock on HRegion 
observed_table,,1359162223098.d367ba98501279646d44c0c1530fb6c4., startKey='', 
getEndKey()='bbb', row='some row'
at 
org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:3191)
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:3240)
at 
org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:3336)
at 
org.apache.hadoop.hbase.coprocessor.SimpleRegionObserver.start(SimpleRegionObserver.java:110)
at 
org.apache.hadoop.hbase.coprocessor.CoprocessorHost$Environment.startup(CoprocessorHost.java:654)
at 
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.loadInstance(CoprocessorHost.java:312)
at 
org.apache.hadoop.hbase.coprocessor.CoprocessorHost.loadSystemCoprocessors(CoprocessorHost.java:149)
at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.init(RegionCoprocessorHost.java:144)
at org.apache.hadoop.hbase.regionserver.HRegion.init(HRegion.java:464)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3937)
at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4118)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
{code}
SimpleRegionObserver is inherited by BuggyRegionObserver. If coprocessor cannot 
start, this test wouldn't succeed.
Here is SimpleRegionObserver#start():
{code}
  public void start(CoprocessorEnvironment e) throws IOException {
// this only makes sure that leases and locks are available to coprocessors
...
Integer lid = re.getRegion().getLock(null, Bytes.toBytes(some row), true);
re.getRegion().releaseRowLock(lid);
{code}
We should at least wrap the call to getLock() with try/catch.

 Increase timeouts in some more test...?
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-4709) Hadoop metrics2 setup in test MiniDFSClusters spewing JMX errors

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563565#comment-13563565
 ] 

Ted Yu commented on HBASE-4709:
---

InstanceAlreadyExistsException appears in 0.94 test output as well
e.g. 
https://builds.apache.org/job/HBase-0.94/776/testReport/junit/org.apache.hadoop.hbase.coprocessor/TestRegionServerCoprocessorExceptionWithAbort/testExceptionFromCoprocessorDuringPut/

Shall we port the fix to 0.94 ?

 Hadoop metrics2 setup in test MiniDFSClusters spewing JMX errors
 

 Key: HBASE-4709
 URL: https://issues.apache.org/jira/browse/HBASE-4709
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.90.4, 0.92.0, 0.94.0, 0.94.1, 0.96.0
Reporter: Gary Helmling
Priority: Minor
 Fix For: 0.96.0

 Attachments: 4709.v2.patch, 4709.v2.patch, 4709_workaround.v1.patch


 Since switching over HBase to build with Hadoop 0.20.205.0, we've been 
 getting a lot of metrics related errors in the log files for tests:
 {noformat}
 2011-10-30 22:00:22,858 INFO  [main] log.Slf4jLog(67): jetty-6.1.26
 2011-10-30 22:00:22,871 INFO  [main] log.Slf4jLog(67): Extract 
 jar:file:/home/jenkins/.m2/repository/org/apache/hadoop/hadoop-core/0.20.205.0/hadoop-core-0.20.205.0.jar!/webapps/datanode
  to /tmp/Jetty_localhost_55751_datanode.kw16hy/webapp
 2011-10-30 22:00:23,048 INFO  [main] log.Slf4jLog(67): Started 
 SelectChannelConnector@localhost:55751
 Starting DataNode 1 with dfs.data.dir: 
 /home/jenkins/jenkins-slave/workspace/HBase-TRUNK/trunk/target/test-data/7ba65a16-03ad-4624-b769-57405945ef58/dfscluster_3775fc23-1b51-4966-8133-205564bae762/dfs/data/data3,/home/jenkins/jenkins-slave/workspace/HBase-TRUNK/trunk/target/test-data/7ba65a16-03ad-4624-b769-57405945ef58/dfscluster_3775fc23-1b51-4966-8133-205564bae762/dfs/data/data4
 2011-10-30 22:00:23,237 WARN  [main] impl.MetricsSystemImpl(137): Metrics 
 system not started: Cannot locate configuration: tried 
 hadoop-metrics2-datanode.properties, hadoop-metrics2.properties
 2011-10-30 22:00:23,237 WARN  [main] util.MBeans(59): 
 Hadoop:service=DataNode,name=MetricsSystem,sub=Control
 javax.management.InstanceAlreadyExistsException: MXBean already registered 
 with name Hadoop:service=NameNode,name=MetricsSystem,sub=Control
   at 
 com.sun.jmx.mbeanserver.MXBeanLookup.addReference(MXBeanLookup.java:120)
   at 
 com.sun.jmx.mbeanserver.MXBeanSupport.register(MXBeanSupport.java:143)
   at 
 com.sun.jmx.mbeanserver.MBeanSupport.preRegister2(MBeanSupport.java:183)
   at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:941)
   at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:917)
   at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:312)
   at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:482)
   at org.apache.hadoop.metrics2.util.MBeans.register(MBeans.java:56)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.initSystemMBean(MetricsSystemImpl.java:500)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:140)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:40)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.initialize(DefaultMetricsSystem.java:50)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1483)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1459)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:280)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniDFSCluster(HBaseTestingUtility.java:349)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:518)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:474)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:461)
 {noformat}
 This seems to be due to errors initializing the new hadoop metrics2 code by 
 default, when running in a mini cluster.  The errors themselves seem to be 
 harmless -- they're not breaking any tests -- but we should figure out what 
 configuration we need to eliminate them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563580#comment-13563580
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Interesting. So that is only failing when we load that coprocessor into a 
region which has a non-empty start key  some row or a non-empty end key  
some row.

I added that part to SimpleRegionObserver. We should just remove it. It is just 
making sure that nobody makes HRegion.getLock(...) private again.


 Increase timeouts in some more test...?
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Matteo Bertozzi (JIRA)
Matteo Bertozzi created HBASE-7682:
--

 Summary: Replace HRegion custom File-System debug, with 
FSUtils.logFileSystemState()
 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0
 Attachments: HBASE-7682-v0.patch

We can remove some code from HRegion by using the new 
FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Summary: Investigate recent random test failures in 0.94  (was: Increase 
timeouts in some more test...?)

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-7682:
---

Attachment: HBASE-7682-v0.patch

 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
 ---

 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0

 Attachments: HBASE-7682-v0.patch


 We can remove some code from HRegion by using the new 
 FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Increase timeouts in some more test...?

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563581#comment-13563581
 ] 

Lars Hofhansl commented on HBASE-7681:
--

In the latest run I see this a failure in TestImportTsv caused by this:
{code}
2013-01-26 18:05:13,972 ERROR [pool-30-thread-1] mapred.JobTracker(4228): Job 
initialization failed:
java.io.IOException: Number of maps in JobConf doesn't match number of recieved 
splits for job job_20130126180500315_0001! numMapTasks=2, #splits=1
at 
org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:703)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4207)
at 
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{code}


 Increase timeouts in some more test...?
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-7682:
---

Status: Patch Available  (was: Open)

 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
 ---

 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0

 Attachments: HBASE-7682-v0.patch


 We can remove some code from HRegion by using the new 
 FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7507) Make memstore flush be able to retry after exception

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563583#comment-13563583
 ] 

Lars Hofhansl commented on HBASE-7507:
--

I think this could be an candidate for revert in light of recent unspecific 
test failures. I'll double check before I do that.

 Make memstore flush be able to retry after exception
 

 Key: HBASE-7507
 URL: https://issues.apache.org/jira/browse/HBASE-7507
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.3
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.96.0, 0.94.5

 Attachments: 7507-94.patch, 7507-trunk v1.patch, 7507-trunk v2.patch, 
 7507-trunkv3.patch


 We will abort regionserver if memstore flush throws exception.
 I thinks we could do retry to make regionserver more stable because file 
 system may be not ok in a transient time. e.g. Switching namenode in the 
 NamenodeHA environment
 {code}
 HRegion#internalFlushcache(){
 ...
 try {
 ...
 }catch(Throwable t){
 DroppedSnapshotException dse = new DroppedSnapshotException(region:  +
   Bytes.toStringBinary(getRegionName()));
 dse.initCause(t);
 throw dse;
 }
 ...
 }
 MemStoreFlusher#flushRegion(){
 ...
 region.flushcache();
 ...
  try {
 }catch(DroppedSnapshotException ex){
 server.abort(Replay of HLog required. Forcing server shutdown, ex);
 }
 ...
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7681:
--

Attachment: 7681-94-v1.txt

Work in progress.

Please let me know the test you're working on so that the progress is faster.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563590#comment-13563590
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Did you mean to use org.mortbay.log.Log?

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7681:
--

Attachment: 7681-94-v2.txt

Patch v2 corrects logging.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563593#comment-13563593
 ] 

Ted Yu commented on HBASE-7681:
---

w.r.t. the mismatch between number of maps and splits, here is how JobConf 
retrieves number of maps:
{code}
  public int getNumMapTasks() { return getInt(mapred.map.tasks, 1); }
{code}
I searched 0.94 code base and didn't see the above config param.
Looks like this error may be out of our control.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7681:
--

Attachment: 7681-94-v3.txt

For TestNodeHealthCheckChore, I lifted the call to checker.init() earlier so 
that there is enough time for the checker to initialize (hopefully).

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563599#comment-13563599
 ] 

Ted Yu commented on HBASE-7682:
---

+1 on patch.

 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
 ---

 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0

 Attachments: HBASE-7682-v0.patch


 We can remove some code from HRegion by using the new 
 FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563607#comment-13563607
 ] 

Hadoop QA commented on HBASE-7682:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12566635/HBASE-7682-v0.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4199//console

This message is automatically generated.

 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
 ---

 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0

 Attachments: HBASE-7682-v0.patch


 We can remove some code from HRegion by using the new 
 FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7681:
--

Attachment: 7681-trunk-v1.txt

Trunk patch v1 mirrors patch v3 for 0.94

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7681:
--

Status: Patch Available  (was: Open)

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563627#comment-13563627
 ] 

Hadoop QA commented on HBASE-7681:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12566640/7681-trunk-v1.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4200//console

This message is automatically generated.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563633#comment-13563633
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Cool.

I think we should include my change to TestAdmin (I've seen this problem in 
TestSplitTransactionOnCluster, where we'd play with splitting, etc, many failed 
assertions would be masked by the test hanging in the finally clause, because 
the table could not be deleted).

Similarly I think the timeout increases I had there are also valid (in that 
they keep the test from looping forever, but will be less likely to interfere 
with a successful run).

Lemme make a combined patch.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563634#comment-13563634
 ] 

Lars Hofhansl commented on HBASE-7681:
--

bq. Looks like this error may be out of our control.

Yeah, I came to the same conclusion.


 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7507) Make memstore flush be able to retry after exception

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563639#comment-13563639
 ] 

Lars Hofhansl commented on HBASE-7507:
--

I would like to revert this from 0.94. Unless we wrap every write to HDFS 
inside a retry loop we have not gained anything anyway.
Please speak up if you disagree.

 Make memstore flush be able to retry after exception
 

 Key: HBASE-7507
 URL: https://issues.apache.org/jira/browse/HBASE-7507
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.3
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.96.0, 0.94.5

 Attachments: 7507-94.patch, 7507-trunk v1.patch, 7507-trunk v2.patch, 
 7507-trunkv3.patch


 We will abort regionserver if memstore flush throws exception.
 I thinks we could do retry to make regionserver more stable because file 
 system may be not ok in a transient time. e.g. Switching namenode in the 
 NamenodeHA environment
 {code}
 HRegion#internalFlushcache(){
 ...
 try {
 ...
 }catch(Throwable t){
 DroppedSnapshotException dse = new DroppedSnapshotException(region:  +
   Bytes.toStringBinary(getRegionName()));
 dse.initCause(t);
 throw dse;
 }
 ...
 }
 MemStoreFlusher#flushRegion(){
 ...
 region.flushcache();
 ...
  try {
 }catch(DroppedSnapshotException ex){
 server.abort(Replay of HLog required. Forcing server shutdown, ex);
 }
 ...
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563643#comment-13563643
 ] 

Lars Hofhansl commented on HBASE-7643:
--

You going to commit [~mbertozzi]?

 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563644#comment-13563644
 ] 

Ted Yu commented on HBASE-7681:
---

A combined patch would be nice.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3996) Support multiple tables and scanners as input to the mapper in map/reduce jobs

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563645#comment-13563645
 ] 

Lars Hofhansl commented on HBASE-3996:
--

Seems this is good to go. [~saint@gmail.com], you had some concern 
initially, have these all been addressed?

 Support multiple tables and scanners as input to the mapper in map/reduce jobs
 --

 Key: HBASE-3996
 URL: https://issues.apache.org/jira/browse/HBASE-3996
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Eran Kutner
Assignee: Bryan Baugher
Priority: Critical
 Fix For: 0.96.0, 0.94.5

 Attachments: 3996-v10.txt, 3996-v11.txt, 3996-v12.txt, 3996-v13.txt, 
 3996-v14.txt, 3996-v2.txt, 3996-v3.txt, 3996-v4.txt, 3996-v5.txt, 
 3996-v6.txt, 3996-v7.txt, 3996-v8.txt, 3996-v9.txt, HBase-3996.patch


 It seems that in many cases feeding data from multiple tables or multiple 
 scanners on a single table can save a lot of time when running map/reduce 
 jobs.
 I propose a new MultiTableInputFormat class that would allow doing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Attachment: 7681-0.94-combined.txt

Combined 0.94 patch
(also replace assertTrue with assertEquals, which tells us the expected value 
upon failure)

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Status: Open  (was: Patch Available)

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563647#comment-13563647
 ] 

Ted Yu commented on HBASE-7681:
---

From 
https://builds.apache.org/job/HBase-0.94/781/testReport/org.apache.hadoop.hbase.regionserver/TestAtomicOperation/testRowMutationMultiThreads/:
{code}
2013-01-26 19:53:55,720 INFO  [Thread-114] regionserver.Store(1129): Completed 
major compaction of 4 file(s) in colfamily11 of 
testtable,,1359230031766.0e4657859f8ff751b545ee6dfd92446e. into 
29f808fd896d4b78ae69c40e36ab2f1e, size=731.0; total size for store is 731.0
2013-01-26 19:53:55,721 DEBUG [Thread-109] 
regionserver.TestAtomicOperation$1(307): 
keyvalues={rowA/colfamily11:qual1/3830/Put/vlen=6/ts=4062, 
rowA/colfamily11:qual2/3838/Put/vlen=6/ts=0}
Exception in thread Thread-109 junit.framework.AssertionFailedError   at 
junit.framework.Assert.fail(Assert.java:48)
at junit.framework.Assert.fail(Assert.java:56)
at 
org.apache.hadoop.hbase.regionserver.TestAtomicOperation$1.run(TestAtomicOperation.java:309)
{code}
This was due to more than one column being visible:
{code}
  // check: should always see exactly one column
  Get g = new Get(row);
  Result r = region.get(g, null);
  if (r.size() != 1) {
LOG.debug(r);
failures.incrementAndGet();
fail();
{code}

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-7643:
---

  Resolution: Fixed
Release Note: committed to trunk and 0.94, thanks guys for the review!
  Status: Resolved  (was: Patch Available)

 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-7643:
---

Release Note:   (was: committed to trunk and 0.94, thanks guys for the 
review!)

 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563648#comment-13563648
 ] 

Matteo Bertozzi commented on HBASE-7643:


committed to trunk and 0.94, thanks guys for the review!

 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-7682:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed to trunk since it's a straightforward patch

 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
 ---

 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0

 Attachments: HBASE-7682-v0.patch


 We can remove some code from HRegion by using the new 
 FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Attachment: 7681-0.94-combined_v2.txt

Slightly better 0.94 (fixes a blank line and I forgot to change a TO from 1000 
to 1500)

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Attachment: 7681-0.96-combined.txt

And the 0.96 version

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Status: Patch Available  (was: Open)

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563655#comment-13563655
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Yeah that one (TestAtomicOperation) has me worried a bit. That one has not 
failed in a *very* long. And new we've had two failures within a few days.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563656#comment-13563656
 ] 

Lars Hofhansl commented on HBASE-7681:
--

The only recent change that could affect that is HBASE-7507.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 7681-94-v2.txt, 
 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Attachment: 7681-0.96-combined.txt

Hadoop QA is not starting. Attaching again

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563659#comment-13563659
 ] 

Matt Corgan commented on HBASE-7660:


I think for this jira we'd only want to remove HFileReaderV1, HFileWriterV1, 
and FSReaderV1.  The main goal being to forcibly drop support for them before 
branching 0.96.  It would leave only one implementation of the interfaces and 
abstract classes.

Later, we can go back and look at the purpose of the interfaces and abstract 
classes.  I'm guessing they are partly designed specifically as a shim layer 
between v1/v2.  A shim layer between v2 and future unknown v3 may have 
completely different needs.

Getting out of the scope of this jira - we may want to flatten the interface, 
abstractClass, and implementation to make it easier to clean up the Store code, 
and then pull out a different interface in the future when we actually have a 
use case for it.

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
 Fix For: 0.96.0


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563660#comment-13563660
 ] 

Ted Yu commented on HBASE-7681:
---

Combined patch looks good.
{code}
-  waitForCounter(SplitLogCounters.tot_wkr_preempt_task, 0, 1, 1000);
+  waitForCounter(SplitLogCounters.tot_wkr_preempt_task, 0, 1, 1500);

-  waitForCounter(SplitLogCounters.tot_wkr_task_acquired, 1, 2, 1000);
+  waitForCounter(SplitLogCounters.tot_wkr_task_acquired, 1, 2, 1500);
{code}
Next time we touch TestSplitLogWorker.java, maybe introduce a constant for the 
timeout so that we don't need to change it in 12 places.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7681:
--

Fix Version/s: 0.96.0
 Hadoop Flags: Reviewed

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563663#comment-13563663
 ] 

Matteo Bertozzi commented on HBASE-7660:


I agree that the old code limits the evolution.
Is there a way to move the code out without too much work, 
and put it in something like hbase-support/hfile-v1-to-v2-tool that can be used 
as migration tool?

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
 Fix For: 0.96.0


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563666#comment-13563666
 ] 

Ted Yu commented on HBASE-7660:
---

Looking at HFile.Reader, most methods don't have javadoc, leaving 
interpretation of the interface to each developer.

In the process of creating Cell interface, we have avoided such mistake.

bq. put it in something like hbase-support/hfile-v1-to-v2-tool that can be used 
as migration tool?
Some users are using 0.94.x and to my knowledge, there was no request for 
creating migration tool.

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
 Fix For: 0.96.0


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563668#comment-13563668
 ] 

Ted Yu commented on HBASE-7681:
---

Looking at 
https://issues.apache.org/jira/secure/attachment/12565724/7507-94.patch :
{code}
+} catch (Exception e) {
+  LOG.warn(Failed validating store file  + pathName
+  + , retring num= + i, e);
+  if (e instanceof IOException) {
+lastException = (IOException) e;
+  } else {
+lastException = new IOException(e);
+  }
+}
+  } catch (IOException e) {
+LOG.warn(Failed flushing store file, retring num= + i, e);
{code}
I searched for the two strings which start with 'Failed ' in 
https://builds.apache.org/job/HBase-0.94/781/testReport/org.apache.hadoop.hbase.regionserver/TestAtomicOperation/testRowMutationMultiThreads/
 but didn't find any occurrence.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563669#comment-13563669
 ] 

Ted Yu commented on HBASE-7681:
---

TestAtomicOperation failed in trunk build #3806

Strangely TestAtomicOperation didn't seem to fail on QA machines.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563671#comment-13563671
 ] 

Lars Hofhansl commented on HBASE-7681:
--

I'm looping TestAtomicOperation on my machine right now. After 45mins no 
failure, yet.
I too looked for these messages and didn't see them, but that changes the 
validation around (and I've seen extremely subtle failure conditions in this 
code before)


 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563674#comment-13563674
 ] 

Hudson commented on HBASE-7682:
---

Integrated in HBase-TRUNK #3806 (See 
[https://builds.apache.org/job/HBase-TRUNK/3806/])
HBASE-7682 Replace HRegion custom File-System debug, with 
FSUtils.logFileSystemState() (Revision 1438974)

 Result = FAILURE
mbertozzi : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java


 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
 ---

 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0

 Attachments: HBASE-7682-v0.patch


 We can remove some code from HRegion by using the new 
 FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563675#comment-13563675
 ] 

Hudson commented on HBASE-7643:
---

Integrated in HBase-TRUNK #3806 (See 
[https://builds.apache.org/job/HBase-TRUNK/3806/])
HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to 
snapshot data loss (Revision 1438972)

 Result = FAILURE
mbertozzi : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java


 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563676#comment-13563676
 ] 

Matt Corgan commented on HBASE-7660:


Should have mentioned in the description: One could probably reason that if 
someone is using 0.90.x right now, they are looking for max stability and will 
not jump straight to 0.96, but rather go to 0.94.5+ first.

Regarding distributing the script: We could bundle the HFileV1-scanner script 
in future 0.94 version as well as make it downloadable to hbase/bin from the 
hbase website for people on 0.92 and earlier 0.94 versions.

{quote}Is there a way to move the code out without too much work, and put it in 
something like hbase-support/hfile-v1-to-v2-tool that can be used as migration 
tool?{quote}Sounds like a good idea, though I wonder if anyone should spend 
time creating it before anyone actually requests it.

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
 Fix For: 0.96.0


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563678#comment-13563678
 ] 

Hadoop QA commented on HBASE-7681:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12566648/7681-0.96-combined.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 27 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.master.TestZKBasedOpenCloseRegion

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4201//console

This message is automatically generated.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563681#comment-13563681
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Test failure's unrelated. Good to commit, Ted?

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563682#comment-13563682
 ] 

Lars Hofhansl commented on HBASE-7681:
--

TestAtomicOperation still looping, btw.

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563683#comment-13563683
 ] 

Hudson commented on HBASE-7643:
---

Integrated in HBase-0.94 #782 (See 
[https://builds.apache.org/job/HBase-0.94/782/])
HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to 
snapshot data loss (Revision 1438973)

 Result = FAILURE
mbertozzi : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java


 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Investigate recent random test failures in 0.94

2013-01-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563686#comment-13563686
 ] 

Ted Yu commented on HBASE-7681:
---

Go ahead, Lars. 

 Investigate recent random test failures in 0.94
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7682) Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563690#comment-13563690
 ] 

Hudson commented on HBASE-7682:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #377 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/377/])
HBASE-7682 Replace HRegion custom File-System debug, with 
FSUtils.logFileSystemState() (Revision 1438974)

 Result = FAILURE
mbertozzi : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java


 Replace HRegion custom File-System debug, with FSUtils.logFileSystemState()
 ---

 Key: HBASE-7682
 URL: https://issues.apache.org/jira/browse/HBASE-7682
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Trivial
 Fix For: 0.96.0

 Attachments: HBASE-7682-v0.patch


 We can remove some code from HRegion by using the new 
 FSUtils.logFileSystemState() instead of the custom HRegion.listPaths()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563691#comment-13563691
 ] 

Hudson commented on HBASE-7643:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #377 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/377/])
HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to 
snapshot data loss (Revision 1438972)

 Result = FAILURE
mbertozzi : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java


 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Investigate recent random test failures

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Summary: Investigate recent random test failures  (was: Investigate recent 
random test failures in 0.94)

 Investigate recent random test failures
 ---

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Summary: Address some recent random test failures  (was: Investigate recent 
random test failures)

 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Himanshu Vashishtha (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563693#comment-13563693
 ] 

Himanshu Vashishtha commented on HBASE-7681:


Since we are talking about test cases health here, I was looking at 
TestRegionServerMetrics a while ago. It opens bunch of HTable instances in 
number of test methods and doesn't close them. Is it possible to fold that here 
too, or a separate jira... it will be a trivial fix.

 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563694#comment-13563694
 ] 

Lars Hofhansl commented on HBASE-7681:
--

Hrmm... Just committed. Let's do an addendum here.


 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Status: Open  (was: Patch Available)

 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-combined.txt, 7681-0.94-combined_v2.txt, 
 7681-0.94.txt, 7681-0.96-combined.txt, 7681-0.96-combined.txt, 
 7681-94-v1.txt, 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7681:
-

Attachment: 7681-0.96-addendum.txt
7681-0.94-addendum.txt

Something like this Himanshu?

 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563697#comment-13563697
 ] 

Andrew Purtell commented on HBASE-7660:
---

bq. I think for this jira we'd only want to remove HFileReaderV1, 
HFileWriterV1, and FSReaderV1. The main goal being to forcibly drop support for 
them before branching 0.96.

This sounds like a good idea. I was worried about more substantial changes 
because I am currently working in HFile V2. That said, with V1 gone it would be 
a good thing that we can simplify and remove code in this area over time. 

bq. hbase-support/hfile-v1-to-v2-tool that can be used as migration tool

This would seem like working against the goal to remove the old code. I think 
anyone upgrading to 0.96 would be doing it from 0.92 or 0.94, which 
auto-upgrades HFiles. We could document in release notes that before upgrading 
the user should run a major compaction.

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
 Fix For: 0.96.0


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563698#comment-13563698
 ] 

Lars Hofhansl commented on HBASE-7681:
--

[~v.himanshu]? :)

 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7660:
--

Attachment: 7660-v1.txt

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
 Fix For: 0.96.0

 Attachments: 7660-v1.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu reassigned HBASE-7660:
-

Assignee: Ted Yu

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
Assignee: Ted Yu
 Fix For: 0.96.0

 Attachments: 7660-v1.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7660:
--

Status: Patch Available  (was: Open)

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
Assignee: Ted Yu
 Fix For: 0.96.0

 Attachments: 7660-v1.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563705#comment-13563705
 ] 

Hadoop QA commented on HBASE-7660:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12566651/7660-v1.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.TestStoreFile
  org.apache.hadoop.hbase.regionserver.TestSplitTransaction
  org.apache.hadoop.hbase.io.hfile.TestHFileReaderV1

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4202//console

This message is automatically generated.

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
Assignee: Ted Yu
 Fix For: 0.96.0

 Attachments: 7660-v1.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563706#comment-13563706
 ] 

Hudson commented on HBASE-7681:
---

Integrated in HBase-0.94 #783 (See 
[https://builds.apache.org/job/HBase-0.94/783/])
HBASE-7681 Address some recent random test failures (Revision 1439004)

 Result = FAILURE
larsh : 
Files : 
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestNodeHealthCheckChore.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/coprocessor/SimpleRegionObserver.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitLogWorker.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/thrift/TestThriftServerCmdLine.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java


 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7660:
--

Attachment: 7660-v2.txt

[~stack]:
Can you provide your opinion ?

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
Assignee: Ted Yu
 Fix For: 0.96.0

 Attachments: 7660-v1.txt, 7660-v2.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563710#comment-13563710
 ] 

Hudson commented on HBASE-7681:
---

Integrated in HBase-TRUNK #3807 (See 
[https://builds.apache.org/job/HBase-TRUNK/3807/])
HBASE-7681 Address some recent random test failures (Revision 1439003)

 Result = FAILURE
larsh : 
Files : 
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/TestNodeHealthCheckChore.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/coprocessor/SimpleRegionObserver.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitLogWorker.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/thrift/TestThriftServerCmdLine.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java


 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563718#comment-13563718
 ] 

Hadoop QA commented on HBASE-7660:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12566653/7660-v2.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 21 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4203//console

This message is automatically generated.

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
Assignee: Ted Yu
 Fix For: 0.96.0

 Attachments: 7660-v1.txt, 7660-v2.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Himanshu Vashishtha (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563719#comment-13563719
 ] 

Himanshu Vashishtha commented on HBASE-7681:


[~lhofhansl] Yea 
Sorry for the delay, was away.

 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563724#comment-13563724
 ] 

Matt Corgan commented on HBASE-7660:


Looks good to me Ted.  Thanks for spending time on it.

Only comment is that maybe you should preserve the constant 
HFileWriterV1.BLOOM_FILTER_DATA_KEY.  Unless you had a specific reason, might 
be better to move it to HFileWriterV2 instead of the following:
{code}
-  bloom = reader.getMetaBlock(HFileWriterV1.BLOOM_FILTER_DATA_KEY,
+  bloom = reader.getMetaBlock(BLOOM_FILTER_DATA,
   true);
{code}

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
Assignee: Ted Yu
 Fix For: 0.96.0

 Attachments: 7660-v1.txt, 7660-v2.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7660) Remove HFileV1 code

2013-01-26 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-7660:
--

Attachment: 7660-v3.txt

 Remove HFileV1 code
 ---

 Key: HBASE-7660
 URL: https://issues.apache.org/jira/browse/HBASE-7660
 Project: HBase
  Issue Type: Improvement
  Components: hbck, HFile, migration
Reporter: Matt Corgan
Assignee: Ted Yu
 Fix For: 0.96.0

 Attachments: 7660-v1.txt, 7660-v2.txt, 7660-v3.txt


 HFileV1 should be removed from the regionserver because it is somewhat of a 
 drag on development for working on the lower level read paths.  It's an 
 impediment to cleaning up the Store code.
 V1 HFiles ceased to be written in 0.92, but the V1 reader was left in place 
 so users could upgrade from 0.90 to 0.92.  Once all HFiles are compacted in 
 0.92, then the V1 code is no longer needed.  We then decided to leave the V1 
 code in place in 0.94 so users could upgrade directly from 0.90 to 0.94.  The 
 code is still there in trunk but should probably be shown the door.  I see a 
 few options:
 1) just delete the code and tell people to make sure they compact everything 
 using 0.92 or 0.94
 2) create a standalone script that people can run on their 0.92 or 0.94 
 cluster that iterates the filesystem and prints out any v1 files with a 
 message that the user should run a major compaction
 3) add functionality to 0.96.0 (first release, maybe in hbck) that 
 proactively kills v1 files, so that we can be sure there are none when 
 upgrading from 0.96 to 0.98
 4) punt to 0.98 and probably do one of the above options in a year
 I would vote for #1 or #2 which will allow us to have a v1-free 0.96.0.  
 HFileV1 has already survived 2 major release upgrades which i think many 
 would agree is more than enough for a pre-1.0, free product.  If we can 
 remove it in 0.96.0 it will be out of the way to introduce some nice 
 performance improvements in subsequent 0.96.x releases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563731#comment-13563731
 ] 

Hudson commented on HBASE-7681:
---

Integrated in HBase-0.94-security #99 (See 
[https://builds.apache.org/job/HBase-0.94-security/99/])
HBASE-7681 Address some recent random test failures (Revision 1439004)

 Result = FAILURE
larsh : 
Files : 
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestNodeHealthCheckChore.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/coprocessor/SimpleRegionObserver.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitLogWorker.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/thrift/TestThriftServerCmdLine.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/util/TestMiniClusterLoadSequential.java


 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7669) ROOT region wouldn't be handled by PRI-IPC-Handler

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563732#comment-13563732
 ] 

Hudson commented on HBASE-7669:
---

Integrated in HBase-0.94-security #99 (See 
[https://builds.apache.org/job/HBase-0.94-security/99/])
HBASE-7669 ROOT region wouldn't be handled by PRI-IPC-Handler(Chunhui) 
(Revision 1438832)

 Result = FAILURE
zjushch : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


 ROOT region wouldn't  be handled by PRI-IPC-Handler
 ---

 Key: HBASE-7669
 URL: https://issues.apache.org/jira/browse/HBASE-7669
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.4
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.96.0, 0.94.5

 Attachments: 7669-94.patch, 7669.addendum, HBASE-7669.patch


 RPC reuqest about ROOT region should be handled by PRI-IPC-Handler, just the 
 same as META region

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7643) HFileArchiver.resolveAndArchive() race condition may lead to snapshot data loss

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563733#comment-13563733
 ] 

Hudson commented on HBASE-7643:
---

Integrated in HBase-0.94-security #99 (See 
[https://builds.apache.org/job/HBase-0.94-security/99/])
HBASE-7643 HFileArchiver.resolveAndArchive() race condition may lead to 
snapshot data loss (Revision 1438973)

 Result = FAILURE
mbertozzi : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/backup/HFileArchiver.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/backup/TestHFileArchiving.java


 HFileArchiver.resolveAndArchive() race condition may lead to snapshot data 
 loss
 ---

 Key: HBASE-7643
 URL: https://issues.apache.org/jira/browse/HBASE-7643
 Project: HBase
  Issue Type: Bug
Affects Versions: hbase-6055, 0.96.0
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Blocker
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7653-p4-v0.patch, HBASE-7653-p4-v1.patch, 
 HBASE-7653-p4-v2.patch, HBASE-7653-p4-v3.patch, HBASE-7653-p4-v4.patch, 
 HBASE-7653-p4-v5.patch, HBASE-7653-p4-v6.patch, HBASE-7653-p4-v7.patch


  * The master have an hfile cleaner thread (that is responsible for cleaning 
 the /hbase/.archive dir)
  ** /hbase/.archive/table/region/family/hfile
  ** if the family/region/family directory is empty the cleaner removes it
  * The master can archive files (from another thread, e.g. DeleteTableHandler)
  * The region can archive files (from another server/process, e.g. compaction)
 The simplified file archiving code looks like this:
 {code}
 HFileArchiver.resolveAndArchive(...) {
   // ensure that the archive dir exists
   fs.mkdir(archiveDir);
   // move the file to the archiver
   success = fs.rename(originalPath/fileName, archiveDir/fileName)
   // if the rename is failed, delete the file without archiving
   if (!success) fs.delete(originalPath/fileName);
 }
 {code}
 Since there's no synchronization between HFileArchiver.resolveAndArchive() 
 and the cleaner run (different process, thread, ...) you can end up in the 
 situation where you are moving something in a directory that doesn't exists.
 {code}
 fs.mkdir(archiveDir);
 // HFileCleaner chore starts at this point
 // and the archiveDirectory that we just ensured to be present gets removed.
 // The rename at this point will fail since the parent directory is missing.
 success = fs.rename(originalPath/fileName, archiveDir/fileName)
 {code}
 The bad thing of deleting the file without archiving is that if you've a 
 snapshot that relies on the file to be present, or you've a clone table that 
 relies on that file is that you're losing data.
 Possible solutions
  * Create a ZooKeeper lock, to notify the master (Hey I'm archiving 
 something, wait a bit)
  * Add a RS - Master call to let the master removes files and avoid this 
 kind of situations
  * Avoid to remove empty directories from the archive if the table exists or 
 is not disabled
  * Add a try catch around the fs.rename
 The last one, the easiest one, looks like:
 {code}
 for (int i = 0; i  retries; ++i) {
   // ensure archive directory to be present
   fs.mkdir(archiveDir);
   //  possible race -
   // try to archive file
   success = fs.rename(originalPath/fileName, archiveDir/fileName);
   if (success) break;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7654) Add ListString getCoprocessors() to HTableDescriptor

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563734#comment-13563734
 ] 

Hudson commented on HBASE-7654:
---

Integrated in HBase-0.94-security #99 (See 
[https://builds.apache.org/job/HBase-0.94-security/99/])
HBASE-7654. Add ListString getCoprocessors() to HTableDescriptor 
(Jean-Marc Spaggiari) (Revision 1438790)

 Result = FAILURE
apurtell : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestHTableDescriptor.java


 Add ListString getCoprocessors() to HTableDescriptor
 --

 Key: HBASE-7654
 URL: https://issues.apache.org/jira/browse/HBASE-7654
 Project: HBase
  Issue Type: Bug
  Components: Client, Coprocessors
Affects Versions: 0.96.0, 0.94.5
Reporter: Jean-Marc Spaggiari
Assignee: Jean-Marc Spaggiari
Priority: Critical
 Fix For: 0.96.0, 0.94.5

 Attachments: HBASE-7654-v0-0.94.patch, HBASE-7654-v0-trunk.patch


 Add ListString getCoprocessors() to HTableInterface to retreive the list of 
 coprocessors loaded into this table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563737#comment-13563737
 ] 

Hudson commented on HBASE-7681:
---

Integrated in HBase-0.94-security #100 (See 
[https://builds.apache.org/job/HBase-0.94-security/100/])
HBASE-7681 Addendum, close tables in TestRegionServerMetrics (Revision 
1439025)

 Result = SUCCESS
larsh : 
Files : 
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java


 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563740#comment-13563740
 ] 

Hudson commented on HBASE-7681:
---

Integrated in HBase-0.94 #786 (See 
[https://builds.apache.org/job/HBase-0.94/786/])
HBASE-7681 Addendum, close tables in TestRegionServerMetrics (Revision 
1439025)

 Result = SUCCESS
larsh : 
Files : 
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java


 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7683) TestMasterObserver fails occasionally

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-7683:
-

Description: 
{code}
org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' 
does not exist  at 
org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182)
  at
org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43)
  at 
org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303)  at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at 
java.lang.reflect.Method.invoke(Method.java:597)  at 
org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372)
  at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 
{code}

  was:
{code}
org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' 
does not exist  at 
org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182)
  at 
org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43)
  at org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303)  at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)  at 
org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372)
  at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 
{code}


 TestMasterObserver fails occasionally
 -

 Key: HBASE-7683
 URL: https://issues.apache.org/jira/browse/HBASE-7683
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl

 {code}
 org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' 
 does not exist  at 
 org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182)
   at
 org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43)
   at 
 org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303)  at 
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at 
 java.lang.reflect.Method.invoke(Method.java:597)  at 
 org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372)
   at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7683) TestMasterObserver fails occasionally

2013-01-26 Thread Lars Hofhansl (JIRA)
Lars Hofhansl created HBASE-7683:


 Summary: TestMasterObserver fails occasionally
 Key: HBASE-7683
 URL: https://issues.apache.org/jira/browse/HBASE-7683
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl


{code}
org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' 
does not exist  at 
org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182)
  at 
org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43)
  at org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303)  at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)  at 
org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372)
  at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7683) TestMasterObserver fails occasionally

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563744#comment-13563744
 ] 

Lars Hofhansl commented on HBASE-7683:
--

I think this log snippet is interesting.

The sequence is this:
# updating /observed_table/.tableinfo.03
# updating /observed_table/.tableinfo.04
# updating /observed_table/.tableinfo.05
# updating /observed_table/.tableinfo.06
# failed to delete /observed_table/.tableinfo.06
# updating /observed_table/.tableinfo.02
# cleaned up /observed_table/.tableinfo.02

So it seems version 2 was somehow updated out of sync

{code}
2013-01-27 05:22:57,737 DEBUG [IPC Server handler 1 on 41253] 
client.ClientScanner(185): Finished with scanning at {NAME = '.META.,,1', 
STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,}
2013-01-27 05:22:57,737 INFO  [IPC Server handler 1 on 41253] 
master.MasterFileSystem(541): AddColumn. Table = observed_table HCD = {NAME = 
'fam2', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE', REPLICATION_SCOPE 
= '0', VERSIONS = '3', COMPRESSION = 'NONE', MIN_VERSIONS = '0', TTL = 
'2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = '65536', IN_MEMORY = 
'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'}
2013-01-27 05:22:57,740 DEBUG [IPC Server handler 1 on 41253] 
util.FSTableDescriptors(501): 
hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tmp/.tableinfo.02
 exists; retrying up to 10 times
2013-01-27 05:22:57,750 INFO  [IPC Server handler 1 on 41253] 
util.FSTableDescriptors(452): Updated 
tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.03
2013-01-27 05:22:57,761 INFO  [IPC Server handler 1 on 41253] 
util.FSTableDescriptors(452): Updated 
tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.04
2013-01-27 05:22:57,764 DEBUG [IPC Server handler 0 on 41253] 
client.ClientScanner(90): Creating scanner over .META. starting at key 
'observed_table,,'
2013-01-27 05:22:57,764 DEBUG [IPC Server handler 0 on 41253] 
client.ClientScanner(198): Advancing internal scanner to startKey at 
'observed_table,,'
2013-01-27 05:22:57,767 INFO  [IPC Server handler 0 on 41253] 
handler.TableEventHandler(91): Handling table operation C_M_MODIFY_FAMILY on 
table observed_table
2013-01-27 05:22:57,768 DEBUG [IPC Server handler 0 on 41253] 
client.ClientScanner(90): Creating scanner over .META. starting at key 
'observed_table,,'
2013-01-27 05:22:57,768 DEBUG [IPC Server handler 0 on 41253] 
client.ClientScanner(198): Advancing internal scanner to startKey at 
'observed_table,,'
2013-01-27 05:22:57,771 DEBUG [IPC Server handler 0 on 41253] 
client.ClientScanner(185): Finished with scanning at {NAME = '.META.,,1', 
STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,}
2013-01-27 05:22:57,771 INFO  [IPC Server handler 0 on 41253] 
master.MasterFileSystem(518): AddModifyColumn. Table = observed_table HCD = 
{NAME = 'fam2', DATA_BLOCK_ENCODING = 'NONE', BLOOMFILTER = 'NONE', 
REPLICATION_SCOPE = '0', VERSIONS = '25', COMPRESSION = 'NONE', MIN_VERSIONS 
= '0', TTL = '2147483647', KEEP_DELETED_CELLS = 'false', BLOCKSIZE = 
'65536', IN_MEMORY = 'false', ENCODE_ON_DISK = 'true', BLOCKCACHE = 'true'}
2013-01-27 05:22:57,787 INFO  [IPC Server handler 0 on 41253] 
util.FSTableDescriptors(452): Updated 
tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.05
2013-01-27 05:22:57,799 INFO  [IPC Server handler 0 on 41253] 
util.FSTableDescriptors(452): Updated 
tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.06
2013-01-27 05:22:57,801 DEBUG [IPC Server handler 1 on 41253] 
client.ClientScanner(90): Creating scanner over .META. starting at key 
'observed_table,,'
2013-01-27 05:22:57,801 DEBUG [IPC Server handler 1 on 41253] 
client.ClientScanner(198): Advancing internal scanner to startKey at 
'observed_table,,'
2013-01-27 05:22:57,808 INFO  [pool-1-thread-1] client.HBaseAdmin(727): Started 
enable of observed_table
2013-01-27 05:22:57,809 DEBUG [pool-1-thread-1] zookeeper.ZKUtil(1571): 
hconnection-0x13c7a752b7c0006 Retrieved 8 byte(s) of data from znode 
/hbase/table/observed_table; data=ENABLING
2013-01-27 05:22:57,809 DEBUG [pool-1-thread-1] client.HBaseAdmin(684): 
Sleeping= 1000ms, waiting for all regions to be enabled in observed_table
2013-01-27 05:22:58,136 WARN  
[MASTER_TABLE_OPERATIONS-hemera.apache.org,41253,1359264165304-0] 
util.FSTableDescriptors(522): Failed delete of 
hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.01; 
continuing
2013-01-27 05:22:58,136 INFO  
[MASTER_TABLE_OPERATIONS-hemera.apache.org,41253,1359264165304-0] 
util.FSTableDescriptors(452): Updated 
tableinfo=hdfs://localhost:40948/user/jenkins/hbase/observed_table/.tableinfo.02
2013-01-27 05:22:58,138 DEBUG 

[jira] [Commented] (HBASE-7683) TestMasterObserver fails occasionally

2013-01-26 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563745#comment-13563745
 ] 

Lars Hofhansl commented on HBASE-7683:
--

If someone feels like looking at this that would be greatly appreciated, as I 
have exhausted my goodwill time to work on tests for a bit.


 TestMasterObserver fails occasionally
 -

 Key: HBASE-7683
 URL: https://issues.apache.org/jira/browse/HBASE-7683
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl

 {code}
 org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 'fam2' 
 does not exist  at 
 org.apache.hadoop.hbase.master.handler.TableEventHandler.hasColumnFamily(TableEventHandler.java:182)
   at
 org.apache.hadoop.hbase.master.handler.TableDeleteFamilyHandler.init(TableDeleteFamilyHandler.java:43)
   at 
 org.apache.hadoop.hbase.master.HMaster.deleteColumn(HMaster.java:1303)  at 
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at 
 java.lang.reflect.Method.invoke(Method.java:597)  at 
 org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:372)
   at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-7681.
--

Resolution: Fixed

Committed to 0.94 and 0.96

 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7681) Address some recent random test failures

2013-01-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563748#comment-13563748
 ] 

Hudson commented on HBASE-7681:
---

Integrated in HBase-TRUNK #3809 (See 
[https://builds.apache.org/job/HBase-TRUNK/3809/])
HBASE-7681 Addendum, close tables in TestRegionServerMetrics (Revision 
1439026)

 Result = FAILURE
larsh : 
Files : 
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerMetrics.java


 Address some recent random test failures
 

 Key: HBASE-7681
 URL: https://issues.apache.org/jira/browse/HBASE-7681
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.96.0, 0.94.5

 Attachments: 7681-0.94-addendum.txt, 7681-0.94-combined.txt, 
 7681-0.94-combined_v2.txt, 7681-0.94.txt, 7681-0.96-addendum.txt, 
 7681-0.96-combined.txt, 7681-0.96-combined.txt, 7681-94-v1.txt, 
 7681-94-v2.txt, 7681-94-v3.txt, 7681-trunk-v1.txt


 I've seen many unspecific test failures recently that cannot be reproduced 
 locally even when running these test is a loop for a very long time.
 Many of these test one way or the other make assumption w.r.t. wall clock 
 time. While I cannot fix that, an option to increase some of these timeout a 
 bit.
 This issue is to remind me to do that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >