[jira] [Commented] (HBASE-6588) enable table throws npe and leaves trash in zk in competition with delete table

2012-08-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438512#comment-13438512
 ] 

Hadoop QA commented on HBASE-6588:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12541719/HBASE-6588-trunk-v6.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 6 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.master.TestDistributedLogSplitting
  
org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
  org.apache.hadoop.hbase.master.TestAssignmentManager
  org.apache.hadoop.hbase.client.TestAdmin
  org.apache.hadoop.hbase.TestMultiVersions

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2636//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2636//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2636//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2636//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2636//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2636//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2636//console

This message is automatically generated.

> enable table throws npe and leaves trash in zk in competition with delete 
> table
> ---
>
> Key: HBASE-6588
> URL: https://issues.apache.org/jira/browse/HBASE-6588
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.0
>Reporter: Zhou wenjian
>Assignee: Zhou wenjian
> Fix For: 0.94.2
>
> Attachments: HBASE-6588-trunk.patch, HBASE-6588-trunk-v2.patch, 
> HBASE-6588-trunk-v3.patch, HBASE-6588-trunk-v4.patch, 
> HBASE-6588-trunk-v5.patch, HBASE-6588-trunk-v6.patch
>
>
> 2012-08-15 19:23:36,178 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Creating scanner over .META. starting at key 'test,,'
> 2012-08-15 19:23:36,178 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Advancing internal scanner to startKey at 'test,,'
> 2012-08-15 19:24:09,180 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Creating scanner over .META. starting at key ''
> 2012-08-15 19:24:09,180 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Advancing internal scanner to startKey at ''
> 2012-08-15 19:24:09,183 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Finished with scanning at {NAME => '.META.,,1', STARTKEY => '', ENDKEY => '', 
> ENCODED => 1028785192,}
> 2012-08-15 19:24:09,183 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: 
> Scanned 2 catalog row(s) and gc'd 0 unreferenced parent region(s)
> 2012-08-15 19:25:12,260 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Deleting region 
> test,,1345029764571.d1e24b251ca6286c840a9a5f571b7db1. from META and FS
> 2012-08-15 19:25:12,263 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
> Deleted region test,,1345029764571.d1e24b251ca6286c840a9a5f571b7db1. from META
> 2012-08-15 19:25:12,265 INFO 
> org.apache.hadoop.hbase.master.handler.EnableTableHandler: Attemping to 
> enable the table test
> 2012-08-15 19:25:12,265 WARN org.apache.hadoop.hbase.zookeeper.ZKTable: 
> Moving table test state to enabling but was not first in disabled state: null
> 2012-08-15 19:25:12,267 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Creating scanner over .META. starting at key 'test,,'
> 2012-08-15 19:25:12,267 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Advancing internal scanner to startKey at 'test,,'
> 2012-08-15 19:25:12,270 DEBUG org.apache.hadoop.hbase.client.ClientScanner: 
> Finished with scanning at {NAME => '.META.,,1', STARTKEY => '', ENDKEY => '', 

[jira] [Created] (HBASE-6625) If we have hundreds of thousands of regions getChildren will encouter zk exception

2012-08-21 Thread Zhou wenjian (JIRA)
Zhou wenjian created HBASE-6625:
---

 Summary: If we have hundreds of thousands of  regions getChildren 
will encouter zk exception
 Key: HBASE-6625
 URL: https://issues.apache.org/jira/browse/HBASE-6625
 Project: HBase
  Issue Type: Bug
Reporter: Zhou wenjian
Assignee: Zhou wenjian


2012-05-13 19:37:37,528 DEBUG 
org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: 
rs=CreateNewTableWith10Regions,\x05\xB3\x06 
g\xE8r\xBB]\x09\xCF,1336724029944.079cb2f8a375e66fa089291b82f2a03f. 
state=OFFLINE, ts=1336909053108 
2012-05-13 19:37:37,528 DEBUG 
org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: 
rs=CreateNewTableWith10Regions,\x08s\x84\x8 
8$7\xB1\xC4\xFCg,1336724030660.76c07780231942231013c7feb5e5eb14. state=OFFLINE, 
ts=1336909055089, server=dw76.kgb.sqa.cm4,60020,1336908983944 
2012-05-13 19:37:37,528 DEBUG 
org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: 
rs=CreateNewTableWith10Regions,\x08s\x89\xC 
B\x9B\xF0\xE4\xCA\x97\xB0,1336724030660.fa38b9d8367387a64a327087cb43b3e0. 
state=OFFLINE, ts=1336909055089, server=dw76.kgb.sqa.cm4,60020,1336908983944 
2012-05-13 19:37:37,528 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
dw76.kgb.sqa.cm4,60020,1336908983944 unassigned znodes=58464 of total=120002 
2012-05-13 19:37:37,758 WARN org.apache.zookeeper.ClientCnxn: Session 
0x13745fc2c8d0001 for server dw51.kgb.sqa.cm4/10.232.98.51:2180, unexpected 
error, clos 
ing socket connection and attempting reconnect 
java.io.IOException: Packet len4320092 is out of range! 
at 
org.apache.zookeeper.ClientCnxn$SendThread.readLength(ClientCnxn.java:710) 
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:869) 
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130) 
2012-05-13 19:37:37,860 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x13745fc2c8d0001 Unable to list children of znode 
/hbase-new4/unassigned 

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /hbase-new4/unassigned 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) 
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243) 
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:302)
 
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:413)
 
at 
org.apache.hadoop.hbase.master.AssignmentManager.nodeChildrenChanged(AssignmentManager.java:759)
 
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:314)
 
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) 
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 
2012-05-13 19:37:37,861 ERROR 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
master:6-0x13745fc2c8d0001 Received unexpected KeeperException, re-thro 
wing exception 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /hbase-new4/unassigned 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) 
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243) 
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:302)
 
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:413)
 
at 
org.apache.hadoop.hbase.master.AssignmentManager.nodeChildrenChanged(AssignmentManager.java:759)
 
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:314)
 
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) 
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 
2012-05-13 19:37:37,861 FATAL org.apache.hadoop.hbase.master.HMaster: 
Unexpected ZK exception reading unassigned children 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /hbase-new4/unassigned 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) 
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243) 
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:302)
 
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:413)
 
at 
org.apache.hadoop.hbase.master.AssignmentManager.nodeChildrenChanged(AssignmentManager.java:759)
 

[jira] [Commented] (HBASE-6625) If we have hundreds of thousands of regions getChildren will encouter zk exception

2012-08-21 Thread Zhou wenjian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438522#comment-13438522
 ] 

Zhou wenjian commented on HBASE-6625:
-

the log appears in 90. 
zk version: 3.3.3
seems 3.4 is affected too. 

when the client read from zk, it will check the length of data default is 4M
if (len < 0 || len >= ClientCnxn.packetLen) {
throw new IOException("Packet len" + len + " is out of range!");
}
I think maybe we can increase the size of jute.maxbuffer when we start the 
cluster.

> If we have hundreds of thousands of  regions getChildren will encouter zk 
> exception
> ---
>
> Key: HBASE-6625
> URL: https://issues.apache.org/jira/browse/HBASE-6625
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhou wenjian
>Assignee: Zhou wenjian
>
> 2012-05-13 19:37:37,528 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
>  rs=CreateNewTableWith10Regions,\x05\xB3\x06 
> g\xE8r\xBB]\x09\xCF,1336724029944.079cb2f8a375e66fa089291b82f2a03f. 
> state=OFFLINE, ts=1336909053108 
> 2012-05-13 19:37:37,528 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
>  rs=CreateNewTableWith10Regions,\x08s\x84\x8 
> 8$7\xB1\xC4\xFCg,1336724030660.76c07780231942231013c7feb5e5eb14. 
> state=OFFLINE, ts=1336909055089, server=dw76.kgb.sqa.cm4,60020,1336908983944 
> 2012-05-13 19:37:37,528 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
>  rs=CreateNewTableWith10Regions,\x08s\x89\xC 
> B\x9B\xF0\xE4\xCA\x97\xB0,1336724030660.fa38b9d8367387a64a327087cb43b3e0. 
> state=OFFLINE, ts=1336909055089, server=dw76.kgb.sqa.cm4,60020,1336908983944 
> 2012-05-13 19:37:37,528 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: 
> dw76.kgb.sqa.cm4,60020,1336908983944 unassigned znodes=58464 of total=120002 
> 2012-05-13 19:37:37,758 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x13745fc2c8d0001 for server dw51.kgb.sqa.cm4/10.232.98.51:2180, unexpected 
> error, clos 
> ing socket connection and attempting reconnect 
> java.io.IOException: Packet len4320092 is out of range! 
> at 
> org.apache.zookeeper.ClientCnxn$SendThread.readLength(ClientCnxn.java:710) 
> at 
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:869) 
> at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130) 
> 2012-05-13 19:37:37,860 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> master:6-0x13745fc2c8d0001 Unable to list children of znode 
> /hbase-new4/unassigned 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase-new4/unassigned 
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:90) 
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) 
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243) 
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:302)
>  
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:413)
>  
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.nodeChildrenChanged(AssignmentManager.java:759)
>  
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:314)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) 
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 
> 2012-05-13 19:37:37,861 ERROR 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
> master:6-0x13745fc2c8d0001 Received unexpected KeeperException, re-thro 
> wing exception 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase-new4/unassigned 
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:90) 
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) 
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243) 
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:302)
>  
> at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:413)
>  
> at 
> org.apache.hadoop.hbase.master.AssignmentManager.nodeChildrenChanged(AssignmentManager.java:759)
>  
> at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:314)
>  
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) 
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.j

[jira] [Updated] (HBASE-6490) 'dfs.client.block.write.retries' value could be increased in HBase

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6490:
---

Description: 
When allocating a new node during writing, hdfs tries 
'dfs.client.block.write.retries' times (default 3) to write the block. When it 
fails, it goes back to the nanenode for a new list, and raises an error if the 
number of retries is reached. In HBase, if the error is while we're writing a 
hlog file, it will trigger a region server abort (as hbase does not trust the 
log anymore). For simple case (new, and as such empty log file), this seems to 
be ok, and we don't lose data. There could be some complex cases if the error 
occurs on a hlog file with already multiple blocks written.

Logs lines are:
"Exception in createBlockOutputStream", then "Abandoning block " followed by 
"Excluding datanode " for a retry.
IOException: "Unable to create new block.", when the number of retries is 
reached.

Probability of occurence seems quite low, (number of bad nodes / number of 
nodes)^(number of retries), and it implies that you have a region server 
without its datanode. But it's per new block.

Increasing the default value of 'dfs.client.block.write.retries' could make 
sense to be better covered in chaotic conditions.
Environment: all  (was: When allocating a new node during writing, hdfs 
tries 'dfs.client.block.write.retries' times (default 3) to write the block. 
When it fails, it goes back to the nanenode for a new list, and raises an error 
if the number of retries is reached. In HBase, if the error is while we're 
writing a hlog file, it will trigger a region server abort (as hbase does not 
trust the log anymore). For simple case (new, and as such empty log file), this 
seems to be ok, and we don't lose data. There could be some complex cases if 
the error occurs on a hlog file with already multiple blocks written.

Logs lines are:
"Exception in createBlockOutputStream", then "Abandoning block " followed by 
"Excluding datanode " for a retry.
IOException: "Unable to create new block.", when the number of retries is 
reached.

Probability of occurence seems quite low, (number of bad nodes / number of 
nodes)^(number of retries), and it implies that you have a region server 
without its datanode. But it's per new block.

Increasing the default value of 'dfs.client.block.write.retries' could make 
sense to be better covered in chaotic conditions.)

> 'dfs.client.block.write.retries' value could be increased in HBase
> --
>
> Key: HBASE-6490
> URL: https://issues.apache.org/jira/browse/HBASE-6490
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
> Environment: all
>Reporter: nkeywal
>Priority: Minor
>
> When allocating a new node during writing, hdfs tries 
> 'dfs.client.block.write.retries' times (default 3) to write the block. When 
> it fails, it goes back to the nanenode for a new list, and raises an error if 
> the number of retries is reached. In HBase, if the error is while we're 
> writing a hlog file, it will trigger a region server abort (as hbase does not 
> trust the log anymore). For simple case (new, and as such empty log file), 
> this seems to be ok, and we don't lose data. There could be some complex 
> cases if the error occurs on a hlog file with already multiple blocks written.
> Logs lines are:
> "Exception in createBlockOutputStream", then "Abandoning block " followed by 
> "Excluding datanode " for a retry.
> IOException: "Unable to create new block.", when the number of retries is 
> reached.
> Probability of occurence seems quite low, (number of bad nodes / number of 
> nodes)^(number of retries), and it implies that you have a region server 
> without its datanode. But it's per new block.
> Increasing the default value of 'dfs.client.block.write.retries' could make 
> sense to be better covered in chaotic conditions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6438) RegionAlreadyInTransitionException needs to give more info to avoid assignment inconsistencies

2012-08-21 Thread rajeshbabu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rajeshbabu updated HBASE-6438:
--

Status: Patch Available  (was: Open)

> RegionAlreadyInTransitionException needs to give more info to avoid 
> assignment inconsistencies
> --
>
> Key: HBASE-6438
> URL: https://issues.apache.org/jira/browse/HBASE-6438
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
>Assignee: rajeshbabu
> Attachments: HBASE-6438_trunk.patch
>
>
> Seeing some of the recent issues in region assignment, 
> RegionAlreadyInTransitionException is one reason after which the region 
> assignment may or may not happen(in the sense we need to wait for the TM to 
> assign).
> In HBASE-6317 we got one problem due to RegionAlreadyInTransitionException on 
> master restart.
> Consider the following case, due to some reason like master restart or 
> external assign call, we try to assign a region that is already getting 
> opened in a RS.
> Now the next call to assign has already changed the state of the znode and so 
> the current assign that is going on the RS is affected and it fails.  The 
> second assignment that started also fails getting RAITE exception.  Finally 
> both assignments not carrying on.  Idea is to find whether any such RAITE 
> exception can be retried or not.
> Here again we have following cases like where
> -> The znode is yet to transitioned from OFFLINE to OPENING in RS
> -> RS may be in the step of openRegion.
> -> RS may be trying to transition OPENING to OPENED.
> -> RS is yet to add to online regions in the RS side.
> Here in openRegion() and updateMeta() any failures we are moving the znode to 
> FAILED_OPEN.  So in these cases getting an RAITE should be ok.  But in other 
> cases the assignment is stopped.
> The idea is to just add the current state of the region assignment in the RIT 
> map in the RS side and using that info we can determine whether the 
> assignment can be retried or not on getting an RAITE.
> Considering the current work going on in AM, pls do share if this is needed 
> atleast in the 0.92/0.94 versions?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438551#comment-13438551
 ] 

nkeywal commented on HBASE-6435:


v14: version I'm going to commit as soon as the local tests (in progress) are 
ok.

> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure.
> This log is written with 'append' on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6435:
---

Attachment: 6435.v14.patch

> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure.
> This log is written with 'append' on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6438) RegionAlreadyInTransitionException needs to give more info to avoid assignment inconsistencies

2012-08-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438556#comment-13438556
 ] 

Hadoop QA commented on HBASE-6438:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12540041/HBASE-6438_trunk.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 6 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.TestDrainingServer

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2637//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2637//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2637//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2637//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2637//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2637//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2637//console

This message is automatically generated.

> RegionAlreadyInTransitionException needs to give more info to avoid 
> assignment inconsistencies
> --
>
> Key: HBASE-6438
> URL: https://issues.apache.org/jira/browse/HBASE-6438
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
>Assignee: rajeshbabu
> Attachments: HBASE-6438_trunk.patch
>
>
> Seeing some of the recent issues in region assignment, 
> RegionAlreadyInTransitionException is one reason after which the region 
> assignment may or may not happen(in the sense we need to wait for the TM to 
> assign).
> In HBASE-6317 we got one problem due to RegionAlreadyInTransitionException on 
> master restart.
> Consider the following case, due to some reason like master restart or 
> external assign call, we try to assign a region that is already getting 
> opened in a RS.
> Now the next call to assign has already changed the state of the znode and so 
> the current assign that is going on the RS is affected and it fails.  The 
> second assignment that started also fails getting RAITE exception.  Finally 
> both assignments not carrying on.  Idea is to find whether any such RAITE 
> exception can be retried or not.
> Here again we have following cases like where
> -> The znode is yet to transitioned from OFFLINE to OPENING in RS
> -> RS may be in the step of openRegion.
> -> RS may be trying to transition OPENING to OPENED.
> -> RS is yet to add to online regions in the RS side.
> Here in openRegion() and updateMeta() any failures we are moving the znode to 
> FAILED_OPEN.  So in these cases getting an RAITE should be ok.  But in other 
> cases the assignment is stopped.
> The idea is to just add the current state of the region assignment in the RIT 
> map in the RS side and using that info we can determine whether the 
> assignment can be retried or not on getting an RAITE.
> Considering the current work going on in AM, pls do share if this is needed 
> atleast in the 0.92/0.94 versions?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438563#comment-13438563
 ] 

nkeywal commented on HBASE-6435:


Ok, local tests said:
Tests in error:
  testGetRowVersions(org.apache.hadoop.hbase.TestMultiVersions): Shutting down
  testScanMultipleVersions(org.apache.hadoop.hbase.TestMultiVersions): 
org.apache.hadoop.hbase.MasterNotRunningException: Can create a proxy to 
master, but it is not running

Not reproduced (tried once).

Committed revision 1375451.

> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure.
> This log is written with 'append' on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6435:
---

 Description: 
HBase writes a Write-Ahead-Log to revover from hardware failure. This log is 
written on hdfs.
Through ZooKeeper, HBase gets informed usually in 30s that it should start the 
recovery process. 
This means reading the Write-Ahead-Log to replay the edits on the other servers.

In standards deployments, HBase process (regionserver) are deployed on the same 
box as the datanodes.

It means that when the box stops, we've actually lost one of the edits, as we 
lost both the regionserver and the datanode.

As HDFS marks a node as dead after ~10 minutes, it appears as available when we 
try to read the blocks to recover. As such, we are delaying the recovery 
process by 60 seconds as the read will usually fail with a socket timeout. If 
the file is still opened for writing, it adds an extra 20s + a risk of losing 
edits if we connect with ipc to the dead DN.


Possible solutions are:
- shorter dead datanodes detection by the NN. Requires a NN code change.
- better dead datanodes management in DFSClient. Requires a DFS code change.
- NN customisation to write the WAL files on another DN instead of the local 
one.
- reordering the blocks returned by the NN on the client side to put the blocks 
on the same DN as the dead RS at the end of the priority queue. Requires a DFS 
code change or a kind of workaround.

The solution retained is the last one. Compared to what was discussed on the 
mailing list, the proposed patch will not modify HDFS source code but adds a 
proxy. This for two reasons:
- Some HDFS functions managing block orders are static 
(MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would require 
to implement partially the fix, change the DFS interface to make this function 
non static, or put the hook static. None of these solution is very clean. 
- Adding a proxy allows to put all the code in HBase, simplifying dependency 
management.

Nevertheless, it would be better to have this in HDFS. But this solution allows 
to target the last version only, and this could allow minimal interface changes 
such as non static methods.

Moreover, writing the blocks to the non local DN would be an even better 
solution long term.






  was:
HBase writes a Write-Ahead-Log to revover from hardware failure.
This log is written with 'append' on hdfs.
Through ZooKeeper, HBase gets informed usually in 30s that it should start the 
recovery process. 
This means reading the Write-Ahead-Log to replay the edits on the other servers.

In standards deployments, HBase process (regionserver) are deployed on the same 
box as the datanodes.

It means that when the box stops, we've actually lost one of the edits, as we 
lost both the regionserver and the datanode.

As HDFS marks a node as dead after ~10 minutes, it appears as available when we 
try to read the blocks to recover. As such, we are delaying the recovery 
process by 60 seconds as the read will usually fail with a socket timeout. If 
the file is still opened for writing, it adds an extra 20s + a risk of losing 
edits if we connect with ipc to the dead DN.


Possible solutions are:
- shorter dead datanodes detection by the NN. Requires a NN code change.
- better dead datanodes management in DFSClient. Requires a DFS code change.
- NN customisation to write the WAL files on another DN instead of the local 
one.
- reordering the blocks returned by the NN on the client side to put the blocks 
on the same DN as the dead RS at the end of the priority queue. Requires a DFS 
code change or a kind of workaround.

The solution retained is the last one. Compared to what was discussed on the 
mailing list, the proposed patch will not modify HDFS source code but adds a 
proxy. This for two reasons:
- Some HDFS functions managing block orders are static 
(MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would require 
to implement partially the fix, change the DFS interface to make this function 
non static, or put the hook static. None of these solution is very clean. 
- Adding a proxy allows to put all the code in HBase, simplifying dependency 
management.

Nevertheless, it would be better to have this in HDFS. But this solution allows 
to target the last version only, and this could allow minimal interface changes 
such as non static methods.

Moreover, writing the blocks to the non local DN would be an even better 
solution long term.






Release Note: 
This JIRA adds a hook in the HDFS client to reorder the replica locations for 
HLog files. The default ordering in HDFS is rack aware + random. When reading a 
HLog file, we prefer not to use the replica on the same server as the region 
server that wrote the HLog: this server is likely to be not available, and this 
will delay the HBase recovery by one minute. This occurs because the recovery

[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438567#comment-13438567
 ] 

nkeywal commented on HBASE-6435:


+ Committed revision 1375454.
As I forgot to add the new test in svn initially.

> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure. This log is 
> written on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6414) Remove the WritableRpcEngine & associated Invocation classes

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438573#comment-13438573
 ] 

nkeywal commented on HBASE-6414:


For this
{noformat}
   /**
-   * Construct an IPC client whose values are of the given {@link Writable}
+   * Construct an IPC client whose values are of the {@link Message}
* class.
* @param valueClass value class
* @param conf configuration
* @param factory socket factory
*/
-  public HBaseClient(Class valueClass, Configuration conf,
-  SocketFactory factory) {
-this.valueClass = valueClass;
+  public HBaseClient(Configuration conf, SocketFactory factory) {
{noformat}

The javadoc was not updated, nor in ipc/ClientCache.java.
I'm fixing it in HBASE-6364. I didn't check if there were other stuff like this.

> Remove the WritableRpcEngine & associated Invocation classes
> 
>
> Key: HBASE-6414
> URL: https://issues.apache.org/jira/browse/HBASE-6414
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.96.0
>Reporter: Devaraj Das
>Assignee: Devaraj Das
> Fix For: 0.96.0
>
> Attachments: 6414-1.patch.txt, 6414-3.patch.txt, 6414-4.patch.txt, 
> 6414-4.patch.txt, 6414-5.patch.txt, 6414-5.patch.txt, 6414-5.patch.txt, 
> 6414-6.patch.txt, 6414-6.patch.txt, 6414-6.txt, 6414-initial.patch.txt, 
> 6414-initial.patch.txt, 6414-v7.txt
>
>
> Remove the WritableRpcEngine & Invocation classes once HBASE-5705 gets 
> committed and all the protocols are rebased to use PB.
> Raising this jira in advance..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438575#comment-13438575
 ] 

Hadoop QA commented on HBASE-6435:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12541734/6435.v14.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 11 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 6 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.io.encoding.TestUpgradeFromHFileV1ToEncoding

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2638//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2638//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2638//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2638//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2638//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2638//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2638//console

This message is automatically generated.

> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure. This log is 
> written on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
>

[jira] [Commented] (HBASE-5449) Support for wire-compatible security functionality

2012-08-21 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438586#comment-13438586
 ] 

Matteo Bertozzi commented on HBASE-5449:


the generated protobuf mergeFrom() code that parse the enum looks like this:
{code}
if (message contains enum) {
get the value of the enum
if (the value of the enum is not in the current proto definition)
mark the field as missing and add to the unknown fields
else
set the enum value
}
{code}

so basically, if you've an enum:
{code}
message MyMessage {
enum MyEnum {
ENUM_VALUE_1 = 1;
ENUM_VALUE_2 = 2;
ENUM_VALUE_3 = 3;
}
required MyEnum myEnum;
}
{code}

and then you switch to:
{code}
...
enum MyEnum {
ENUM_VALUE_1 = 1;
//ENUM_VALUE_2 = 2;
ENUM_VALUE_3 = 3;
ENUM_VALUE_4 = 4;
}
...
{code}
The old client can check for message.hasMyEnum(). 
if it is false, field is "missing" that in this case means I'm not able to find 
the enum value in my current proto.
so protobuf is able to handle the case without problems. 
The conversion code from protobuf to the real object is responsible to preserve 
the compatibiliy...

I'll fix the patch to avoid throwing the exception in case of permission action 
enum mismatch!

> Support for wire-compatible security functionality
> --
>
> Key: HBASE-5449
> URL: https://issues.apache.org/jira/browse/HBASE-5449
> Project: HBase
>  Issue Type: Sub-task
>  Components: ipc, master, migration, regionserver
>Reporter: Todd Lipcon
>Assignee: Matteo Bertozzi
> Attachments: AccessControl_protos.patch, HBASE-5449-v0.patch, 
> HBASE-5449-v1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6364:
---

Attachment: 6364.v11.nolargetest.patch

Version that will be committed if the local tests (in progress) are ok.

> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438599#comment-13438599
 ] 

Hudson commented on HBASE-6435:
---

Integrated in HBase-TRUNK #3247 (See 
[https://builds.apache.org/job/HBase-TRUNK/3247/])
HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS 
timeouts when using dead datanodes - addendum TestBlockReorder.java (Revision 
1375454)
HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS 
timeouts when using dead datanodes (Revision 1375451)

 Result = FAILURE
nkeywal : 
Files : 
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/fs
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/fs/TestBlockReorder.java

nkeywal : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java


> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure. This log is 
> written on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438605#comment-13438605
 ] 

nkeywal commented on HBASE-6435:


Résultats des tests (2 échecs / ±0)
org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions
org.apache.hadoop.hbase.TestMultiVersions.testScanMultipleVersions

Hum. It's the same error as the one I had in my fist local test. But it's so 
unrelated, and moreover we had this error in build #3242 as well; so I think 
it's ok. Marking as resolved.

> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure. This log is 
> written on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6435:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure. This log is 
> written on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438612#comment-13438612
 ] 

nkeywal commented on HBASE-6364:


Committed revision 1375473.

After another fishy and not reproduced error on 
org.apache.hadoop.hbase.TestMultiVersions

> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438623#comment-13438623
 ] 

Hudson commented on HBASE-6435:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #140 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/140/])
HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS 
timeouts when using dead datanodes - addendum TestBlockReorder.java (Revision 
1375454)
HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS 
timeouts when using dead datanodes (Revision 1375451)

 Result = FAILURE
nkeywal : 
Files : 
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/fs
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/fs/TestBlockReorder.java

nkeywal : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ServerName.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java


> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> 
>
> Key: HBASE-6435
> URL: https://issues.apache.org/jira/browse/HBASE-6435
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Assignee: nkeywal
> Fix For: 0.96.0
>
> Attachments: 6435.unfinished.patch, 6435.v10.patch, 6435.v10.patch, 
> 6435.v12.patch, 6435.v12.patch, 6435.v12.patch, 6435-v12.txt, 6435.v13.patch, 
> 6435.v14.patch, 6435.v2.patch, 6435.v7.patch, 6435.v8.patch, 6435.v9.patch, 
> 6435.v9.patch, 6535.v11.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure. This log is 
> written on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438622#comment-13438622
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #140 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/140/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table (Revision 1375473)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/ClientCache.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestHBaseClient.java


> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6603) RegionMetricsStorage.incrNumericMetric is called too often

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438624#comment-13438624
 ] 

Hudson commented on HBASE-6603:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #140 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/140/])
HBASE-6603 RegionMetricsStorage.incrNumericMetric is called too often (M. 
Chen) (Revision 1375312)

 Result = FAILURE
larsh : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java


> RegionMetricsStorage.incrNumericMetric is called too often
> --
>
> Key: HBASE-6603
> URL: https://issues.apache.org/jira/browse/HBASE-6603
> Project: HBase
>  Issue Type: Bug
>  Components: performance
>Reporter: Lars Hofhansl
>Assignee: M. Chen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6503-0.96.txt, 6603-0.94.txt
>
>
> Running an HBase scan load through the profiler revealed that 
> RegionMetricsStorage.incrNumericMetric is called way too often.
> It turns out that we make this call for *each* KV in StoreScanner.next(...).
> Incrementing AtomicLong requires expensive memory barriers.
> The observation here is that StoreScanner.next(...) can maintain a simple 
> long in its internal loop and only update the metric upon exit. Thus the 
> AtomicLong is not updated nearly as often.
> That cuts about 10% runtime from scan only load (I'll quantify this better 
> soon).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438684#comment-13438684
 ] 

Hudson commented on HBASE-6364:
---

Integrated in HBase-TRUNK #3248 (See 
[https://builds.apache.org/job/HBase-TRUNK/3248/])
HBASE-6364 Powering down the server host holding the .META. table causes 
HBase Client to take excessively long to recover and connect to reassigned 
.META. table (Revision 1375473)

 Result = FAILURE
nkeywal : 
Files : 
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/ClientCache.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestHBaseClient.java


> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6604) Bump log4j to 1.2.17

2012-08-21 Thread Jonathan Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh updated HBASE-6604:
--

Attachment: hbase-6604.patch

> Bump log4j to 1.2.17
> 
>
> Key: HBASE-6604
> URL: https://issues.apache.org/jira/browse/HBASE-6604
> Project: HBase
>  Issue Type: Bug
>Reporter: Jonathan Hsieh
> Attachments: hbase-6604.patch
>
>
> Hadoop bumped to 1.2.17 log4j (HADOOP-8687), we should probably as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-6604) Bump log4j to 1.2.17

2012-08-21 Thread Jonathan Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh reassigned HBASE-6604:
-

Assignee: Jonathan Hsieh

> Bump log4j to 1.2.17
> 
>
> Key: HBASE-6604
> URL: https://issues.apache.org/jira/browse/HBASE-6604
> Project: HBase
>  Issue Type: Bug
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
> Attachments: hbase-6604.patch
>
>
> Hadoop bumped to 1.2.17 log4j (HADOOP-8687), we should probably as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HBASE-6604) Bump log4j to 1.2.17

2012-08-21 Thread Jonathan Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh resolved HBASE-6604.
---

   Resolution: Fixed
Fix Version/s: 0.96.0

Trivial change, tested by compiling.

> Bump log4j to 1.2.17
> 
>
> Key: HBASE-6604
> URL: https://issues.apache.org/jira/browse/HBASE-6604
> Project: HBase
>  Issue Type: Bug
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
> Fix For: 0.96.0
>
> Attachments: hbase-6604.patch
>
>
> Hadoop bumped to 1.2.17 log4j (HADOOP-8687), we should probably as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2012-08-21 Thread nkeywal (JIRA)
nkeywal created HBASE-6626:
--

 Summary: Add a chapter on HDFS in the troubleshooting section of 
the HBase reference guide.
 Key: HBASE-6626
 URL: https://issues.apache.org/jira/browse/HBASE-6626
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.96.0
Reporter: nkeywal
Priority: Minor


I looked mainly at the major failure case, but here is what I have:

New sub chapter in the existing chapter "Troubleshooting and Debugging HBase": 
"HDFS & HBASE"

1) HDFS & HBase
2) Connection related settings
2.1) Number of retries
2.2) Timeouts
3) Log samples


1) HDFS & HBase
HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
Write-Ahead-Logs, i.e. the files that will be used to restore the data after a 
crash.
In both cases, the reliability of HBase comes from the fact that HDFS writes 
the data to multiple locations. To be efficient, HBase needs the data to be 
available locally, hence it's highly recommended to have the HDFS datanode on 
the same machines as the HBase Region Servers.

Detailed information on how HDFS works can be found at [1].

Important features are:
 - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
This class can appears in HBase logs with other HDFS client related logs.
 - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, 
while some other are HDFS-client-side, i.e. must be set in HBase, while some 
other must be set in both places.
 - the HDFS writes are pipelined from one datanode to another. When writing, 
there are communications between:
- HBase and HDFS namenode, through the HDFS client classes.
- HBase and HDFS datanodes, through the HDFS client classes.
- HDFS datanode between themselves: issues on these communications are in 
HDFS logs, not HBase. HDFS writes are always local when possible. As a 
consequence, there should not be much write error in HBase Region Servers: they 
write to the local datanode. If this datanode can't replicate the blocks, it 
will appear in its logs, not in the region servers logs.
 - datanodes can be contacted through the ipc.Client interface (once again this 
class can shows up in HBase logs) and the data transfer interface (usually 
shows up as the DataNode class in the HBase logs). There are on different ports 
(defaults being: 50010 and 50020).
 - To understand exactly what's going on, you must look that the HDFS log files 
as well: HBase logs represent the client side.
 - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
this reason, this node will still be tried by HBase or by other datanodes when 
writing and reading until HDFS definitively decides it's dead. This will add 
some extras lines in the logs. This monitoring is performed by the NameNode.
 - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the 
NameNode, but can mark temporally a node as dead if they had an error when they 
tried to use it.

2) Settings for retries and timeouts
2.1) Retries
ipc.client.connect.max.retries
Default 10
Indicates the number of retries a client will make to establish a server 
connection. Not taken into account if the error is a SocketTimeout. In this 
case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
increased, especially if the socket timeouts have been lowered.

ipc.client.connect.max.retries.on.timeouts
Default 45
If you have HADOOP-7932, max number of retries on timeout. Counter is different 
than ipc.client.connect.max.retries so if you mix the socket errors you will 
get 55 retries with the default values. Could be lowered, once it is available. 
With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 
tries.

dfs.client.block.write.retries
Default 3
Number of tries for the client when writing a block. After a failure, will 
connect to the namenode a get a new location, sending the list of the datanodes 
already tried without success. Could be increased, especially if the socket 
timeouts have been lowered. See HBASE-6490.

dfs.client.block.write.locateFollowingBlock.retries
Default 5
Number of retries to the namenode when the client got 
NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
replicated to dfs.replication.min. This should not impact HBase, as 
dfs.replication.min is defaulted to 1.

dfs.client.max.block.acquire.failures
Default 3
Number of tries to read a block from the datanodes list. In other words, if 5 
datanodes are supposed to hold a block (so dfs.replication equals to 5), the 
client will try all these datanodes, then check the value of 
dfs.client.max.block.acquire.failures to see if it should retry or not. If so, 
it will get a new list (likely the same), and will try to reconnect again to 
all t

[jira] [Commented] (HBASE-5937) Refactor HLog into an interface.

2012-08-21 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438734#comment-13438734
 ] 

Ivan Kelly commented on HBASE-5937:
---

{quote}
Can you factory it in places like HLogInputFormat?
{quote}
HLogInputFormat among a few funny ones which I wasn't sure how to deal with 
(one of the others is the PrettyPrinter). The problem with these is that they 
take a individual log file, rather than a whole log instance. This makes it 
more complicated to use it with HLogFactory, as createHLog asks for the whole 
log directory. We could get the log directory by using Path.getParent(), but 
that seemed messy to me at the time. Otherwise, yes, the solution to 
HLogUtil.createReader -> HLog#createReader is to instantiate a HLog where 
needed. We just haven't gotten that far yet.

{quote}
Whats FSLog? An HDFSLog?
{quote}
Exactly. FileSystem doesn't necessarily have to be HDFS so FSLog is a better 
name.

{quote}
Can you not get HLogFactory.createHLog into the places where we have getReader 
now – e.g. in HRegion (Should HRegion even be concerned w/ HLog/WAL? Only 
RegionServer should be?)?
{quote}
As I understand it now, HRegion only creates the HLog for the META and ROOT 
tables, which are managed from the Master and as such, do not have access to a 
RegionServer. Perhaps this could be refactored a little to make HRegion always 
receive a preconstructed HLog.

{quote}
Should HLog Interface be instead named WAL?
Is it right that the HLog Interface takes an fs? That OK for you lads? You'll 
be doing a bookkeeper FS?
HLog Interface seems fat. We need all those methods?
{quote}
The current patch is a first cut to give you guys an idea of where we are, but 
it's quite far from what we imagine the final interface from looking like. At 
the moment, what we want to do is refactor all the HLog code so that everything 
that access HLog is going through well defined interfaces. Once that is done, 
we can look at the interfaces to see where the implementation specific stuff 
(like Path, Filesystem etc) is leaking out, and work to resolve them.

I'll try and find time to get back to do some coding on this (i.e. fix tests, 
refactor createReader) this week.

> Refactor HLog into an interface.
> 
>
> Key: HBASE-5937
> URL: https://issues.apache.org/jira/browse/HBASE-5937
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Li Pi
>Assignee: Flavio Junqueira
>Priority: Minor
>
> What the summary says. Create HLog interface. Make current implementation use 
> it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6604) Bump log4j to 1.2.17

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438770#comment-13438770
 ] 

Hudson commented on HBASE-6604:
---

Integrated in HBase-TRUNK #3249 (See 
[https://builds.apache.org/job/HBase-TRUNK/3249/])
HBASE-6604 Bump log4j to 1.2.17 (Revision 1375534)

 Result = FAILURE
jmhsieh : 
Files : 
* /hbase/trunk/pom.xml


> Bump log4j to 1.2.17
> 
>
> Key: HBASE-6604
> URL: https://issues.apache.org/jira/browse/HBASE-6604
> Project: HBase
>  Issue Type: Bug
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
> Fix For: 0.96.0
>
> Attachments: hbase-6604.patch
>
>
> Hadoop bumped to 1.2.17 log4j (HADOOP-8687), we should probably as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6621) Reduce calls to Bytes.toInt

2012-08-21 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438787#comment-13438787
 ] 

Todd Lipcon commented on HBASE-6621:


Oops, yea, I missed the fact that the caching was removed again in the later 
patches. The change makes sense to me.

> Reduce calls to Bytes.toInt
> ---
>
> Key: HBASE-6621
> URL: https://issues.apache.org/jira/browse/HBASE-6621
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Minor
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6621-0.96.txt, 6621-0.96-v2.txt, 6621-0.96-v3.txt, 
> 6621-0.96-v4.txt
>
>
> Bytes.toInt shows up quite often in a profiler run.
> It turns out that one source is HFileReaderV2$ScannerV2.getKeyValue().
> Notice that we call the KeyValue(byte[], int) constructor, which forces the 
> constructor to determine its size by reading some of the header information 
> and calculate the size. In this case, however, we already know the size (from 
> the call to readKeyValueLen), so we could just use that.
> In the extreme case of 1's of columns this noticeably reduces CPU. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6621) Reduce calls to Bytes.toInt

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438793#comment-13438793
 ] 

stack commented on HBASE-6621:
--

I +1'd v2 Lars.  v4 looks good to me.

> Reduce calls to Bytes.toInt
> ---
>
> Key: HBASE-6621
> URL: https://issues.apache.org/jira/browse/HBASE-6621
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Minor
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6621-0.96.txt, 6621-0.96-v2.txt, 6621-0.96-v3.txt, 
> 6621-0.96-v4.txt
>
>
> Bytes.toInt shows up quite often in a profiler run.
> It turns out that one source is HFileReaderV2$ScannerV2.getKeyValue().
> Notice that we call the KeyValue(byte[], int) constructor, which forces the 
> constructor to determine its size by reading some of the header information 
> and calculate the size. In this case, however, we already know the size (from 
> the call to readKeyValueLen), so we could just use that.
> In the extreme case of 1's of columns this noticeably reduces CPU. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6581) Build with hadoop.profile=3.0

2012-08-21 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438798#comment-13438798
 ] 

Ivan Kelly commented on HBASE-6581:
---

The hbase-hadoop3-compat module seems to be missing from the patch.

> Build with hadoop.profile=3.0
> -
>
> Key: HBASE-6581
> URL: https://issues.apache.org/jira/browse/HBASE-6581
> Project: HBase
>  Issue Type: Bug
>Reporter: Eric Charles
> Attachments: HBASE-6581-1.patch
>
>
> Building trunk with hadoop.profile=3.0 gives exceptions (see [1]) due to 
> change in the hadoop maven modules naming (and also usage of 3.0-SNAPSHOT 
> instead of 3.0.0-SNAPSHOT in hbase-common).
> I can provide a patch that would move most of hadoop dependencies in their 
> respective profiles and will define the correct hadoop deps in the 3.0 
> profile.
> Please tell me if that's ok to go this way.
> Thx, Eric
> [1]
> $ mvn clean install -Dhadoop.profile=3.0
> [INFO] Scanning for projects...
> [ERROR] The build could not read 3 projects -> [Help 1]
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-server:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-server/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 655, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 659, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 663, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-common:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-common/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 170, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 174, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 178, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-it:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-it/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 220, column 18
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 224, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 228, column 21
> [ERROR] 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6581) Build with hadoop.profile=3.0

2012-08-21 Thread Eric Charles (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Charles updated HBASE-6581:


Attachment: HBASE-6581-2.patch

Hi Ivan,
HBASE-6581-2.patch has now the hadoop3-compat module.
Thx for your review,
Eric

> Build with hadoop.profile=3.0
> -
>
> Key: HBASE-6581
> URL: https://issues.apache.org/jira/browse/HBASE-6581
> Project: HBase
>  Issue Type: Bug
>Reporter: Eric Charles
> Attachments: HBASE-6581-1.patch, HBASE-6581-2.patch
>
>
> Building trunk with hadoop.profile=3.0 gives exceptions (see [1]) due to 
> change in the hadoop maven modules naming (and also usage of 3.0-SNAPSHOT 
> instead of 3.0.0-SNAPSHOT in hbase-common).
> I can provide a patch that would move most of hadoop dependencies in their 
> respective profiles and will define the correct hadoop deps in the 3.0 
> profile.
> Please tell me if that's ok to go this way.
> Thx, Eric
> [1]
> $ mvn clean install -Dhadoop.profile=3.0
> [INFO] Scanning for projects...
> [ERROR] The build could not read 3 projects -> [Help 1]
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-server:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-server/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 655, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 659, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 663, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-common:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-common/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 170, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 174, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 178, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-it:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-it/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 220, column 18
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 224, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 228, column 21
> [ERROR] 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6627) TestMultiVersions.testGetRowVersions is flaky

2012-08-21 Thread nkeywal (JIRA)
nkeywal created HBASE-6627:
--

 Summary: TestMultiVersions.testGetRowVersions is flaky
 Key: HBASE-6627
 URL: https://issues.apache.org/jira/browse/HBASE-6627
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.96.0
 Environment: hadoop-qa mainly, seems to happen tests in parallel; 
difficult to reproduce on a single test.
Reporter: nkeywal
Assignee: nkeywal


org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions


Shutting down

Stacktrace

java.io.IOException: Shutting down
at 
org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:229)
at 
org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:92)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:688)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:661)
at 
org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions(TestMultiVersions.java:143)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47)
at org.junit.rules.RunRules.evaluate(RunRules.java:18)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6581) Build with hadoop.profile=3.0

2012-08-21 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438834#comment-13438834
 ] 

Ivan Kelly commented on HBASE-6581:
---

Have you done a full build with 3.0.0-SNAPSHOT?

hbase-server/src/test/java/org/apache/hadoop/hbase/fs/TestBlockReorder.java 
fails to build due to 
https://github.com/apache/hadoop-common/commit/68aab040955acd483794dbf1def0b12ac5ff59e8

> Build with hadoop.profile=3.0
> -
>
> Key: HBASE-6581
> URL: https://issues.apache.org/jira/browse/HBASE-6581
> Project: HBase
>  Issue Type: Bug
>Reporter: Eric Charles
> Attachments: HBASE-6581-1.patch, HBASE-6581-2.patch
>
>
> Building trunk with hadoop.profile=3.0 gives exceptions (see [1]) due to 
> change in the hadoop maven modules naming (and also usage of 3.0-SNAPSHOT 
> instead of 3.0.0-SNAPSHOT in hbase-common).
> I can provide a patch that would move most of hadoop dependencies in their 
> respective profiles and will define the correct hadoop deps in the 3.0 
> profile.
> Please tell me if that's ok to go this way.
> Thx, Eric
> [1]
> $ mvn clean install -Dhadoop.profile=3.0
> [INFO] Scanning for projects...
> [ERROR] The build could not read 3 projects -> [Help 1]
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-server:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-server/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 655, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 659, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 663, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-common:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-common/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 170, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 174, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 178, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-it:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-it/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 220, column 18
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 224, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 228, column 21
> [ERROR] 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6581) Build with hadoop.profile=3.0

2012-08-21 Thread Eric Charles (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438839#comment-13438839
 ] 

Eric Charles commented on HBASE-6581:
-

Ivan, from what I have experienced, building hbase depending on 
hadoop3-SNAPSHOT is tricky in the sense that hadoop source trunk is evolving.

I have come to the conclusion that maven build pulls the most-updated snapshot 
jars from the repo, but this can give "inconsistencies" and break the hbase 
build when you enable hadoop.profile=3.0.

I have overcome this by installing hadoop locally with 'mvn install 
-DskipTests', and after packaging hbase with the maven -o option so no dep is 
fetched from the network and I am sure the deps it depends on are not fetched 
from somewhere else.

A the time I have submitted the patch, I was sure everything was fine: 
compilation and even working with a hadoop3 yarn cluster.

Now, I should retest everything, but I have just changed job and country 
yesterday so I prolly won't be able to do it within the coming days.

> Build with hadoop.profile=3.0
> -
>
> Key: HBASE-6581
> URL: https://issues.apache.org/jira/browse/HBASE-6581
> Project: HBase
>  Issue Type: Bug
>Reporter: Eric Charles
> Attachments: HBASE-6581-1.patch, HBASE-6581-2.patch
>
>
> Building trunk with hadoop.profile=3.0 gives exceptions (see [1]) due to 
> change in the hadoop maven modules naming (and also usage of 3.0-SNAPSHOT 
> instead of 3.0.0-SNAPSHOT in hbase-common).
> I can provide a patch that would move most of hadoop dependencies in their 
> respective profiles and will define the correct hadoop deps in the 3.0 
> profile.
> Please tell me if that's ok to go this way.
> Thx, Eric
> [1]
> $ mvn clean install -Dhadoop.profile=3.0
> [INFO] Scanning for projects...
> [ERROR] The build could not read 3 projects -> [Help 1]
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-server:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-server/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 655, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 659, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 663, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-common:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-common/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 170, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 174, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 178, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-it:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-it/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 220, column 18
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 224, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 228, column 21
> [ERROR] 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6581) Build with hadoop.profile=3.0

2012-08-21 Thread Ivan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438841#comment-13438841
 ] 

Ivan Kelly commented on HBASE-6581:
---

Ah ok, I'll have a go tomorrow, see what I can do.

> Build with hadoop.profile=3.0
> -
>
> Key: HBASE-6581
> URL: https://issues.apache.org/jira/browse/HBASE-6581
> Project: HBase
>  Issue Type: Bug
>Reporter: Eric Charles
> Attachments: HBASE-6581-1.patch, HBASE-6581-2.patch
>
>
> Building trunk with hadoop.profile=3.0 gives exceptions (see [1]) due to 
> change in the hadoop maven modules naming (and also usage of 3.0-SNAPSHOT 
> instead of 3.0.0-SNAPSHOT in hbase-common).
> I can provide a patch that would move most of hadoop dependencies in their 
> respective profiles and will define the correct hadoop deps in the 3.0 
> profile.
> Please tell me if that's ok to go this way.
> Thx, Eric
> [1]
> $ mvn clean install -Dhadoop.profile=3.0
> [INFO] Scanning for projects...
> [ERROR] The build could not read 3 projects -> [Help 1]
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-server:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-server/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 655, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 659, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 663, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-common:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-common/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 170, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 174, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 178, column 21
> [ERROR]   
> [ERROR]   The project org.apache.hbase:hbase-it:0.95-SNAPSHOT 
> (/d/hbase.svn/hbase-it/pom.xml) has 3 errors
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-common:jar is missing. @ line 220, column 18
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-annotations:jar is missing. @ line 224, column 21
> [ERROR] 'dependencies.dependency.version' for 
> org.apache.hadoop:hadoop-minicluster:jar is missing. @ line 228, column 21
> [ERROR] 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5937) Refactor HLog into an interface.

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438842#comment-13438842
 ] 

stack commented on HBASE-5937:
--

bq. The problem with these is that they take a individual log file, rather than 
a whole log instance.

Can we change this boss and pass in an HLog or whatever instance?

bq. We just haven't gotten that far yet.

np


bq. Perhaps this could be refactored a little to make HRegion always receive a 
preconstructed HLog.

Yes.  The fact that HRegion is making HLog instances smells.

And +1 on your approach.



> Refactor HLog into an interface.
> 
>
> Key: HBASE-5937
> URL: https://issues.apache.org/jira/browse/HBASE-5937
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Li Pi
>Assignee: Flavio Junqueira
>Priority: Minor
>
> What the summary says. Create HLog interface. Make current implementation use 
> it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6627) TestMultiVersions.testGetRowVersions is flaky

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6627:
---

Attachment: 6627.v1.patch

> TestMultiVersions.testGetRowVersions is flaky
> -
>
> Key: HBASE-6627
> URL: https://issues.apache.org/jira/browse/HBASE-6627
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.96.0
> Environment: hadoop-qa mainly, seems to happen tests in parallel; 
> difficult to reproduce on a single test.
>Reporter: nkeywal
>Assignee: nkeywal
> Attachments: 6627.v1.patch
>
>
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions
> Shutting down
> Stacktrace
> java.io.IOException: Shutting down
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:229)
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:92)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:688)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:661)
>   at 
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions(TestMultiVersions.java:143)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:18)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6627) TestMultiVersions.testGetRowVersions is flaky

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438844#comment-13438844
 ] 

nkeywal commented on HBASE-6627:


Don't ask me why, it seems to make it. Ok locally after 50 runs. Will try it 
here.

> TestMultiVersions.testGetRowVersions is flaky
> -
>
> Key: HBASE-6627
> URL: https://issues.apache.org/jira/browse/HBASE-6627
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.96.0
> Environment: hadoop-qa mainly, seems to happen tests in parallel; 
> difficult to reproduce on a single test.
>Reporter: nkeywal
>Assignee: nkeywal
> Attachments: 6627.v1.patch
>
>
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions
> Shutting down
> Stacktrace
> java.io.IOException: Shutting down
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:229)
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:92)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:688)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:661)
>   at 
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions(TestMultiVersions.java:143)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:18)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6627) TestMultiVersions.testGetRowVersions is flaky

2012-08-21 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6627:
---

Status: Patch Available  (was: Open)

> TestMultiVersions.testGetRowVersions is flaky
> -
>
> Key: HBASE-6627
> URL: https://issues.apache.org/jira/browse/HBASE-6627
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.96.0
> Environment: hadoop-qa mainly, seems to happen tests in parallel; 
> difficult to reproduce on a single test.
>Reporter: nkeywal
>Assignee: nkeywal
> Attachments: 6627.v1.patch
>
>
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions
> Shutting down
> Stacktrace
> java.io.IOException: Shutting down
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:229)
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:92)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:688)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:661)
>   at 
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions(TestMultiVersions.java:143)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:18)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6626:
-

Attachment: troubleshooting.txt

Started converting to docbook.

Nicolas, you are missing the [1] link below.  What did you intend to point to?

> Add a chapter on HDFS in the troubleshooting section of the HBase reference 
> guide.
> --
>
> Key: HBASE-6626
> URL: https://issues.apache.org/jira/browse/HBASE-6626
> Project: HBase
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Priority: Minor
> Attachments: troubleshooting.txt
>
>
> I looked mainly at the major failure case, but here is what I have:
> New sub chapter in the existing chapter "Troubleshooting and Debugging 
> HBase": "HDFS & HBASE"
> 1) HDFS & HBase
> 2) Connection related settings
> 2.1) Number of retries
> 2.2) Timeouts
> 3) Log samples
> 1) HDFS & HBase
> HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
> Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
> a crash.
> In both cases, the reliability of HBase comes from the fact that HDFS writes 
> the data to multiple locations. To be efficient, HBase needs the data to be 
> available locally, hence it's highly recommended to have the HDFS datanode on 
> the same machines as the HBase Region Servers.
> Detailed information on how HDFS works can be found at [1].
> Important features are:
>  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
> This class can appears in HBase logs with other HDFS client related logs.
>  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
> side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
> some other must be set in both places.
>  - the HDFS writes are pipelined from one datanode to another. When writing, 
> there are communications between:
> - HBase and HDFS namenode, through the HDFS client classes.
> - HBase and HDFS datanodes, through the HDFS client classes.
> - HDFS datanode between themselves: issues on these communications are in 
> HDFS logs, not HBase. HDFS writes are always local when possible. As a 
> consequence, there should not be much write error in HBase Region Servers: 
> they write to the local datanode. If this datanode can't replicate the 
> blocks, it will appear in its logs, not in the region servers logs.
>  - datanodes can be contacted through the ipc.Client interface (once again 
> this class can shows up in HBase logs) and the data transfer interface 
> (usually shows up as the DataNode class in the HBase logs). There are on 
> different ports (defaults being: 50010 and 50020).
>  - To understand exactly what's going on, you must look that the HDFS log 
> files as well: HBase logs represent the client side.
>  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
> this reason, this node will still be tried by HBase or by other datanodes 
> when writing and reading until HDFS definitively decides it's dead. This will 
> add some extras lines in the logs. This monitoring is performed by the 
> NameNode.
>  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
> the NameNode, but can mark temporally a node as dead if they had an error 
> when they tried to use it.
> 2) Settings for retries and timeouts
> 2.1) Retries
> ipc.client.connect.max.retries
> Default 10
> Indicates the number of retries a client will make to establish a server 
> connection. Not taken into account if the error is a SocketTimeout. In this 
> case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
> HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
> increased, especially if the socket timeouts have been lowered.
> ipc.client.connect.max.retries.on.timeouts
> Default 45
> If you have HADOOP-7932, max number of retries on timeout. Counter is 
> different than ipc.client.connect.max.retries so if you mix the socket errors 
> you will get 55 retries with the default values. Could be lowered, once it is 
> available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
> would be 10 tries.
> dfs.client.block.write.retries
> Default 3
> Number of tries for the client when writing a block. After a failure, will 
> connect to the namenode a get a new location, sending the list of the 
> datanodes already tried without success. Could be increased, especially if 
> the socket timeouts have been lowered. See HBASE-6490.
> dfs.client.block.write.locateFollowingBlock.retries
> Default 5
> Number of retries to the namenode when the client got 
> NotReplicatedYetException, i.e. the existing nodes of the files are not y

[jira] [Commented] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438849#comment-13438849
 ] 

stack commented on HBASE-6626:
--

Doug, you want to take on this one?

> Add a chapter on HDFS in the troubleshooting section of the HBase reference 
> guide.
> --
>
> Key: HBASE-6626
> URL: https://issues.apache.org/jira/browse/HBASE-6626
> Project: HBase
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Priority: Minor
> Attachments: troubleshooting.txt
>
>
> I looked mainly at the major failure case, but here is what I have:
> New sub chapter in the existing chapter "Troubleshooting and Debugging 
> HBase": "HDFS & HBASE"
> 1) HDFS & HBase
> 2) Connection related settings
> 2.1) Number of retries
> 2.2) Timeouts
> 3) Log samples
> 1) HDFS & HBase
> HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
> Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
> a crash.
> In both cases, the reliability of HBase comes from the fact that HDFS writes 
> the data to multiple locations. To be efficient, HBase needs the data to be 
> available locally, hence it's highly recommended to have the HDFS datanode on 
> the same machines as the HBase Region Servers.
> Detailed information on how HDFS works can be found at [1].
> Important features are:
>  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
> This class can appears in HBase logs with other HDFS client related logs.
>  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
> side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
> some other must be set in both places.
>  - the HDFS writes are pipelined from one datanode to another. When writing, 
> there are communications between:
> - HBase and HDFS namenode, through the HDFS client classes.
> - HBase and HDFS datanodes, through the HDFS client classes.
> - HDFS datanode between themselves: issues on these communications are in 
> HDFS logs, not HBase. HDFS writes are always local when possible. As a 
> consequence, there should not be much write error in HBase Region Servers: 
> they write to the local datanode. If this datanode can't replicate the 
> blocks, it will appear in its logs, not in the region servers logs.
>  - datanodes can be contacted through the ipc.Client interface (once again 
> this class can shows up in HBase logs) and the data transfer interface 
> (usually shows up as the DataNode class in the HBase logs). There are on 
> different ports (defaults being: 50010 and 50020).
>  - To understand exactly what's going on, you must look that the HDFS log 
> files as well: HBase logs represent the client side.
>  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
> this reason, this node will still be tried by HBase or by other datanodes 
> when writing and reading until HDFS definitively decides it's dead. This will 
> add some extras lines in the logs. This monitoring is performed by the 
> NameNode.
>  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
> the NameNode, but can mark temporally a node as dead if they had an error 
> when they tried to use it.
> 2) Settings for retries and timeouts
> 2.1) Retries
> ipc.client.connect.max.retries
> Default 10
> Indicates the number of retries a client will make to establish a server 
> connection. Not taken into account if the error is a SocketTimeout. In this 
> case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
> HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
> increased, especially if the socket timeouts have been lowered.
> ipc.client.connect.max.retries.on.timeouts
> Default 45
> If you have HADOOP-7932, max number of retries on timeout. Counter is 
> different than ipc.client.connect.max.retries so if you mix the socket errors 
> you will get 55 retries with the default values. Could be lowered, once it is 
> available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
> would be 10 tries.
> dfs.client.block.write.retries
> Default 3
> Number of tries for the client when writing a block. After a failure, will 
> connect to the namenode a get a new location, sending the list of the 
> datanodes already tried without success. Could be increased, especially if 
> the socket timeouts have been lowered. See HBASE-6490.
> dfs.client.block.write.locateFollowingBlock.retries
> Default 5
> Number of retries to the namenode when the client got 
> NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
> replicated to dfs.replication.min. This should no

[jira] [Commented] (HBASE-6627) TestMultiVersions.testGetRowVersions is flaky

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438850#comment-13438850
 ] 

stack commented on HBASE-6627:
--

I won't ask you why.  Patch looks fine.

> TestMultiVersions.testGetRowVersions is flaky
> -
>
> Key: HBASE-6627
> URL: https://issues.apache.org/jira/browse/HBASE-6627
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.96.0
> Environment: hadoop-qa mainly, seems to happen tests in parallel; 
> difficult to reproduce on a single test.
>Reporter: nkeywal
>Assignee: nkeywal
> Attachments: 6627.v1.patch
>
>
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions
> Shutting down
> Stacktrace
> java.io.IOException: Shutting down
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:229)
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:92)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:688)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:661)
>   at 
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions(TestMultiVersions.java:143)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:18)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6490) 'dfs.client.block.write.retries' value could be increased in HBase

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438854#comment-13438854
 ] 

stack commented on HBASE-6490:
--

What should we increase it to N?  We can't increase it just for WAL... it'd be 
globally?

> 'dfs.client.block.write.retries' value could be increased in HBase
> --
>
> Key: HBASE-6490
> URL: https://issues.apache.org/jira/browse/HBASE-6490
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
> Environment: all
>Reporter: nkeywal
>Priority: Minor
>
> When allocating a new node during writing, hdfs tries 
> 'dfs.client.block.write.retries' times (default 3) to write the block. When 
> it fails, it goes back to the nanenode for a new list, and raises an error if 
> the number of retries is reached. In HBase, if the error is while we're 
> writing a hlog file, it will trigger a region server abort (as hbase does not 
> trust the log anymore). For simple case (new, and as such empty log file), 
> this seems to be ok, and we don't lose data. There could be some complex 
> cases if the error occurs on a hlog file with already multiple blocks written.
> Logs lines are:
> "Exception in createBlockOutputStream", then "Abandoning block " followed by 
> "Excluding datanode " for a retry.
> IOException: "Unable to create new block.", when the number of retries is 
> reached.
> Probability of occurence seems quite low, (number of bad nodes / number of 
> nodes)^(number of retries), and it implies that you have a region server 
> without its datanode. But it's per new block.
> Increasing the default value of 'dfs.client.block.write.retries' could make 
> sense to be better covered in chaotic conditions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438855#comment-13438855
 ] 

Lars Hofhansl commented on HBASE-6364:
--

@Nicholas: What's your feeling here... Should this go into 0.94? The patch 
seems contained to me.

> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6626) Add a chapter on HDFS in the troubleshooting section of the HBase reference guide.

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438858#comment-13438858
 ] 

nkeywal commented on HBASE-6626:


bq. Nicolas, you are missing the [1] link below. What did you intend to point 
to?

Oops. [1] http://www.aosabook.org/en/hdfs.html

> Add a chapter on HDFS in the troubleshooting section of the HBase reference 
> guide.
> --
>
> Key: HBASE-6626
> URL: https://issues.apache.org/jira/browse/HBASE-6626
> Project: HBase
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 0.96.0
>Reporter: nkeywal
>Priority: Minor
> Attachments: troubleshooting.txt
>
>
> I looked mainly at the major failure case, but here is what I have:
> New sub chapter in the existing chapter "Troubleshooting and Debugging 
> HBase": "HDFS & HBASE"
> 1) HDFS & HBase
> 2) Connection related settings
> 2.1) Number of retries
> 2.2) Timeouts
> 3) Log samples
> 1) HDFS & HBase
> HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
> Write-Ahead-Logs, i.e. the files that will be used to restore the data after 
> a crash.
> In both cases, the reliability of HBase comes from the fact that HDFS writes 
> the data to multiple locations. To be efficient, HBase needs the data to be 
> available locally, hence it's highly recommended to have the HDFS datanode on 
> the same machines as the HBase Region Servers.
> Detailed information on how HDFS works can be found at [1].
> Important features are:
>  - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
> This class can appears in HBase logs with other HDFS client related logs.
>  - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS 
> side, while some other are HDFS-client-side, i.e. must be set in HBase, while 
> some other must be set in both places.
>  - the HDFS writes are pipelined from one datanode to another. When writing, 
> there are communications between:
> - HBase and HDFS namenode, through the HDFS client classes.
> - HBase and HDFS datanodes, through the HDFS client classes.
> - HDFS datanode between themselves: issues on these communications are in 
> HDFS logs, not HBase. HDFS writes are always local when possible. As a 
> consequence, there should not be much write error in HBase Region Servers: 
> they write to the local datanode. If this datanode can't replicate the 
> blocks, it will appear in its logs, not in the region servers logs.
>  - datanodes can be contacted through the ipc.Client interface (once again 
> this class can shows up in HBase logs) and the data transfer interface 
> (usually shows up as the DataNode class in the HBase logs). There are on 
> different ports (defaults being: 50010 and 50020).
>  - To understand exactly what's going on, you must look that the HDFS log 
> files as well: HBase logs represent the client side.
>  - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
> this reason, this node will still be tried by HBase or by other datanodes 
> when writing and reading until HDFS definitively decides it's dead. This will 
> add some extras lines in the logs. This monitoring is performed by the 
> NameNode.
>  - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on 
> the NameNode, but can mark temporally a node as dead if they had an error 
> when they tried to use it.
> 2) Settings for retries and timeouts
> 2.1) Retries
> ipc.client.connect.max.retries
> Default 10
> Indicates the number of retries a client will make to establish a server 
> connection. Not taken into account if the error is a SocketTimeout. In this 
> case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
> HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
> increased, especially if the socket timeouts have been lowered.
> ipc.client.connect.max.retries.on.timeouts
> Default 45
> If you have HADOOP-7932, max number of retries on timeout. Counter is 
> different than ipc.client.connect.max.retries so if you mix the socket errors 
> you will get 55 retries with the default values. Could be lowered, once it is 
> available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there 
> would be 10 tries.
> dfs.client.block.write.retries
> Default 3
> Number of tries for the client when writing a block. After a failure, will 
> connect to the namenode a get a new location, sending the list of the 
> datanodes already tried without success. Could be increased, especially if 
> the socket timeouts have been lowered. See HBASE-6490.
> dfs.client.block.write.locateFollowingBlock.retries
> Default 5
> Number of retries to the namenode when the client got 
> NotReplicatedYetException, i.

[jira] [Commented] (HBASE-6524) Hooks for hbase tracing

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438860#comment-13438860
 ] 

stack commented on HBASE-6524:
--

Do you have any more pretty graphs you can post here?

> Hooks for hbase tracing
> ---
>
> Key: HBASE-6524
> URL: https://issues.apache.org/jira/browse/HBASE-6524
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jonathan Leavitt
> Attachments: hbase-6524.diff
>
>
> Includes the hooks that use [htrace|http://www.github.com/cloudera/htrace] 
> library to add dapper-like tracing to hbase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6524) Hooks for hbase tracing

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438859#comment-13438859
 ] 

stack commented on HBASE-6524:
--

bq... so if you think the extra overhead from crating the TraceInfo is bad, I'm 
happy to change it.

Its fine.  Its a small object instantiation.  Better if we didn't have to do it 
but if it makes for cleaner hand off, its fine.

bq. ...or describe them more thoroughly in SpanReceiverHost.java? Or somewhere 
else?

Its fine.  The class comment in SpanReceiverHost is enough.

bq. ...maybe start a new span any time we talk to HDFS

Yeah, that sounds good.  Can do that in new JIRA?

bq.  I will work on a blog post.

I just think this a topic that will be of general interest.

For refguide, a sentence or two w/ pointers on  how to set it all up would be 
great.

I think we should commit this if its fine by you.


> Hooks for hbase tracing
> ---
>
> Key: HBASE-6524
> URL: https://issues.apache.org/jira/browse/HBASE-6524
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jonathan Leavitt
> Attachments: hbase-6524.diff
>
>
> Includes the hooks that use [htrace|http://www.github.com/cloudera/htrace] 
> library to add dapper-like tracing to hbase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6621) Reduce calls to Bytes.toInt

2012-08-21 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-6621:
-

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Committed to 0.94 and 0.96.

> Reduce calls to Bytes.toInt
> ---
>
> Key: HBASE-6621
> URL: https://issues.apache.org/jira/browse/HBASE-6621
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Minor
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6621-0.96.txt, 6621-0.96-v2.txt, 6621-0.96-v3.txt, 
> 6621-0.96-v4.txt
>
>
> Bytes.toInt shows up quite often in a profiler run.
> It turns out that one source is HFileReaderV2$ScannerV2.getKeyValue().
> Notice that we call the KeyValue(byte[], int) constructor, which forces the 
> constructor to determine its size by reading some of the header information 
> and calculate the size. In this case, however, we already know the size (from 
> the call to readKeyValueLen), so we could just use that.
> In the extreme case of 1's of columns this noticeably reduces CPU. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6490) 'dfs.client.block.write.retries' value could be increased in HBase

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438863#comment-13438863
 ] 

nkeywal commented on HBASE-6490:


I don't think it's an issue to increase globally. I haven't yet looked at the 
memstore flush, but I think it's gonna be the same or worse: we don't really 
expecting a write to fail.
I need to check if it can be a fixed value or if we need to take into account 
the replication factor or the number of machine...

> 'dfs.client.block.write.retries' value could be increased in HBase
> --
>
> Key: HBASE-6490
> URL: https://issues.apache.org/jira/browse/HBASE-6490
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Affects Versions: 0.96.0
> Environment: all
>Reporter: nkeywal
>Priority: Minor
>
> When allocating a new node during writing, hdfs tries 
> 'dfs.client.block.write.retries' times (default 3) to write the block. When 
> it fails, it goes back to the nanenode for a new list, and raises an error if 
> the number of retries is reached. In HBase, if the error is while we're 
> writing a hlog file, it will trigger a region server abort (as hbase does not 
> trust the log anymore). For simple case (new, and as such empty log file), 
> this seems to be ok, and we don't lose data. There could be some complex 
> cases if the error occurs on a hlog file with already multiple blocks written.
> Logs lines are:
> "Exception in createBlockOutputStream", then "Abandoning block " followed by 
> "Excluding datanode " for a retry.
> IOException: "Unable to create new block.", when the number of retries is 
> reached.
> Probability of occurence seems quite low, (number of bad nodes / number of 
> nodes)^(number of retries), and it implies that you have a region server 
> without its datanode. But it's per new block.
> Increasing the default value of 'dfs.client.block.write.retries' could make 
> sense to be better covered in chaotic conditions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6621) Reduce calls to Bytes.toInt

2012-08-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438864#comment-13438864
 ] 

Lars Hofhansl commented on HBASE-6621:
--

Thanks for the reviews.

> Reduce calls to Bytes.toInt
> ---
>
> Key: HBASE-6621
> URL: https://issues.apache.org/jira/browse/HBASE-6621
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Minor
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6621-0.96.txt, 6621-0.96-v2.txt, 6621-0.96-v3.txt, 
> 6621-0.96-v4.txt
>
>
> Bytes.toInt shows up quite often in a profiler run.
> It turns out that one source is HFileReaderV2$ScannerV2.getKeyValue().
> Notice that we call the KeyValue(byte[], int) constructor, which forces the 
> constructor to determine its size by reading some of the header information 
> and calculate the size. In this case, however, we already know the size (from 
> the call to readKeyValueLen), so we could just use that.
> In the extreme case of 1's of columns this noticeably reduces CPU. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6619) Do no unregister and re-register interest ops in RPC

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438869#comment-13438869
 ] 

stack commented on HBASE-6619:
--

Thanks Michal for explanation.  Would be sweet improvement.

> Do no unregister and re-register interest ops in RPC
> 
>
> Key: HBASE-6619
> URL: https://issues.apache.org/jira/browse/HBASE-6619
> Project: HBase
>  Issue Type: Bug
>  Components: ipc
>Reporter: Karthik Ranganathan
>Assignee: Michal Gregorczyk
>
> While investigating perf of HBase, Michal noticed that we could cut about 
> 5-40% (depending on number of threads) from the total get time in the RPC on 
> the server side if we eliminated re-registering for interest ops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438874#comment-13438874
 ] 

nkeywal commented on HBASE-6364:


@lars
I think it's reasonably safe (well I wrote it :-) ). On the other hand to have 
the issue it solves you need to be a little bit unlucky as well. As the 
consequences are quite bad when you're unlucky I think it's better to do it. 
And if there is an issue you can deactivate it (by setting 
hbase.ipc.client.failed.servers.expiry to zero), or lower the expiry time to 
something as 100ms. If it breaks something I will fix it for both versions.

> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6622) TestUpgradeFromHFileV1ToEncoding#testUpgrade fails in trunk

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438880#comment-13438880
 ] 

stack commented on HBASE-6622:
--

This failure starts with 
https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/3242/ (I think)

HBASE-6414 Remove the WritableRpcEngine & associated Invocation classes; REMOVE 
EMPTY FILES (detail)

(Which seems unrelated)

It doesn't fail always it seems.  Build #3244 passes.

I tried it locally and it passes.

It fails a bunch though up on jenkins.



> TestUpgradeFromHFileV1ToEncoding#testUpgrade fails in trunk
> ---
>
> Key: HBASE-6622
> URL: https://issues.apache.org/jira/browse/HBASE-6622
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Ted Yu
>
> TestUpgradeFromHFileV1ToEncoding started to fail since build #3242
> Build #3246 was more recent one where it failed.
> {code}
> 2012-08-21 00:49:06,536 INFO  
> [SplitLogWorker-vesta.apache.org,40294,1345510146310] 
> regionserver.SplitLogWorker(135): SplitLogWorker 
> vesta.apache.org,40294,1345510146310 starting
> 2012-08-21 00:49:06,537 INFO  
> [RegionServer:0;vesta.apache.org,40294,1345510146310] 
> regionserver.HRegionServer(2431): Registered RegionServer MXBean
> 2012-08-21 00:49:06,620 WARN  [Master:0;vesta.apache.org,60969,1345510146282] 
> master.AssignmentManager(1606): Failed assignment of -ROOT-,,0.70236052 to 
> vesta.apache.org,40294,1345510146310, trying to assign elsewhere instead; 
> retry=0
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
>   at 
> org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:187)
>   at $Proxy15.openRegion(Unknown Source)
>   at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:500)
>   at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1587)
>   at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1256)
>   at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1226)
>   at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1221)
>   at 
> org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2103)
>   at 
> org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:785)
>   at 
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:665)
>   at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:439)
>   at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1766)
>   at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1187)
>   at 
> org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:178)
>   ... 11 more
> 2012-08-21 00:49:06,621 INFO  [Master:0;vesta.apache.org,60969,1345510146282] 
> master.RegionStates(250): Region {NAME => '-ROOT-,,0', STARTKEY => '', ENDKEY 
> => '', ENCODED => 70236052,} transitioned from {-ROOT-,,0.70236052 
> state=PENDING_OPEN, ts=1345510146520, 
> server=vesta.apache.org,40294,1345510146310} to {-ROOT-,,0.70236052 
> state=OFFLINE, ts=1345510146621, server=null}
> 2012-08-21 00:49:06,621 WARN  [Master:0;vesta.apache.org,60969,1345510146282] 
> master.AssignmentManager(1772): Can't move the region 70236052, there is no 
> destination server available.
> 2012-08-21 00:49:06,621 WARN  [Master:0;vesta.apache.org,60969,1345510146282] 
> master.AssignmentManager(1618): Unable to find a viable location to assign 
> region -ROOT-,,0.70236052
> 2012-08-21 00:50:06,406 DEBUG 
> [Master:0;vesta.apache.org,60969,1345510146282.archivedHFileCleaner] 
> cleaner.CleanerChore(145): Checking directory: 
> hdfs://localhost:56237/user/hudson/hbase/.archive/UpgradeTable
> {code}
> Looks like ROOT region couldn't be assigned.

--
This message is automatically generated by JIRA.
If you think it was sen

[jira] [Updated] (HBASE-6502) Typo in loaded coprocessors on master status page

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6502:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Resolving:

{code}

r1368697 | tedyu | 2012-08-02 13:47:31 -0700 (Thu, 02 Aug 2012) | 3 lines

HBASE-6502 Typo in loaded coprocessors on master status page (Konstantin)
{code}

> Typo in loaded coprocessors on master status page
> -
>
> Key: HBASE-6502
> URL: https://issues.apache.org/jira/browse/HBASE-6502
> Project: HBase
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.92.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Minor
>  Labels: newbie
> Fix For: 0.96.0
>
> Attachments: DoubleLoadedTypo.patch, DoubleLoadedTypo.patch
>
>
> The master status Web UI page says:
> "Coprocessors currently loaded loaded by the master"
> Should be one "loaded" there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6607) NullPointerException when accessing master web ui while master is initializing

2012-08-21 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-6607:
---

   Resolution: Fixed
Fix Version/s: 0.96.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Integrated into trunk. Thanks Elliott for reviewing it.

> NullPointerException when accessing master web ui while master is initializing
> --
>
> Key: HBASE-6607
> URL: https://issues.apache.org/jira/browse/HBASE-6607
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.96.0
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Trivial
>  Labels: noob
> Fix For: 0.96.0
>
> Attachments: trunk-6607.patch
>
>
> Probably I tried to check the master web ui too soon.  I got some internal 
> error page.  In the master log, there is such exception:
> {noformat}
> 2012-08-17 16:06:25,146 ERROR org.mortbay.log: /master-status
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hbase.master.HMaster.isCatalogJanitorEnabled(HMaster.java:1213)
> at 
> org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:72)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
> at 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6627) TestMultiVersions.testGetRowVersions is flaky

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438892#comment-13438892
 ] 

nkeywal commented on HBASE-6627:


Worked 100 times locally. I will commit it if there is no failure on 
TestMultiVersions on the prebuild env...
Thanks for the review Stack.

> TestMultiVersions.testGetRowVersions is flaky
> -
>
> Key: HBASE-6627
> URL: https://issues.apache.org/jira/browse/HBASE-6627
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.96.0
> Environment: hadoop-qa mainly, seems to happen tests in parallel; 
> difficult to reproduce on a single test.
>Reporter: nkeywal
>Assignee: nkeywal
> Attachments: 6627.v1.patch
>
>
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions
> Shutting down
> Stacktrace
> java.io.IOException: Shutting down
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:229)
>   at 
> org.apache.hadoop.hbase.MiniHBaseCluster.(MiniHBaseCluster.java:92)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:688)
>   at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:661)
>   at 
> org.apache.hadoop.hbase.TestMultiVersions.testGetRowVersions(TestMultiVersions.java:143)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:47)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:18)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6381) AssignmentManager should use the same logic for clean startup and failover

2012-08-21 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438897#comment-13438897
 ] 

Jimmy Xiang commented on HBASE-6381:


@Ram, please review it too. It's better to catch issues before committing.

> AssignmentManager should use the same logic for clean startup and failover
> --
>
> Key: HBASE-6381
> URL: https://issues.apache.org/jira/browse/HBASE-6381
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: hbase-6381-notes.pdf
>
>
> Currently AssignmentManager handles clean startup and failover very 
> differently.
> Different logic is mingled together so it is hard to find out which is for 
> which.
> We should clean it up and share the same logic so that AssignmentManager 
> handles
> both cases the same way.  This way, the code will much easier to understand 
> and
> maintain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5449) Support for wire-compatible security functionality

2012-08-21 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-5449:
---

Attachment: HBASE-5449-v2.patch

Attached a new patch.

The throw if the if the Action doesn't match the enum during the conversion is 
still there, since I've verified that by calling the  getActionList(), protobuf 
does already the work of filtering the "unknown" enum values.

> Support for wire-compatible security functionality
> --
>
> Key: HBASE-5449
> URL: https://issues.apache.org/jira/browse/HBASE-5449
> Project: HBase
>  Issue Type: Sub-task
>  Components: ipc, master, migration, regionserver
>Reporter: Todd Lipcon
>Assignee: Matteo Bertozzi
> Attachments: AccessControl_protos.patch, HBASE-5449-v0.patch, 
> HBASE-5449-v1.patch, HBASE-5449-v2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5449) Support for wire-compatible security functionality

2012-08-21 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-5449:
---

Status: Patch Available  (was: Open)

> Support for wire-compatible security functionality
> --
>
> Key: HBASE-5449
> URL: https://issues.apache.org/jira/browse/HBASE-5449
> Project: HBase
>  Issue Type: Sub-task
>  Components: ipc, master, migration, regionserver
>Reporter: Todd Lipcon
>Assignee: Matteo Bertozzi
> Attachments: AccessControl_protos.patch, HBASE-5449-v0.patch, 
> HBASE-5449-v1.patch, HBASE-5449-v2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6610) HFileLink: Hardlink alternative for snapshot restore

2012-08-21 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-6610:
---

Attachment: HBASE-6610-v1.patch

> HFileLink: Hardlink alternative for snapshot restore
> 
>
> Key: HBASE-6610
> URL: https://issues.apache.org/jira/browse/HBASE-6610
> Project: HBase
>  Issue Type: Sub-task
>  Components: io
>Affects Versions: 0.96.0
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>  Labels: snapshot
> Fix For: 0.96.0
>
> Attachments: HBASE-6610-v1.patch
>
>
> To avoid copying data during restore snapshot we need to introduce an HFile 
> Link  that allows to reference a file that can be in the original path 
> (/hbase/table/region/cf/hfile) or, if the file is archived, in the archive 
> directory (/hbase/.archive/table/region/cf/hfile).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6610) HFileLink: Hardlink alternative for snapshot restore

2012-08-21 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-6610:
---

Attachment: (was: HBASE-6610-v1.patch)

> HFileLink: Hardlink alternative for snapshot restore
> 
>
> Key: HBASE-6610
> URL: https://issues.apache.org/jira/browse/HBASE-6610
> Project: HBase
>  Issue Type: Sub-task
>  Components: io
>Affects Versions: 0.96.0
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>  Labels: snapshot
> Fix For: 0.96.0
>
> Attachments: HBASE-6610-v1.patch
>
>
> To avoid copying data during restore snapshot we need to introduce an HFile 
> Link  that allows to reference a file that can be in the original path 
> (/hbase/table/region/cf/hfile) or, if the file is archived, in the archive 
> directory (/hbase/.archive/table/region/cf/hfile).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6610) HFileLink: Hardlink alternative for snapshot restore

2012-08-21 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-6610:
---

Status: Patch Available  (was: Open)

> HFileLink: Hardlink alternative for snapshot restore
> 
>
> Key: HBASE-6610
> URL: https://issues.apache.org/jira/browse/HBASE-6610
> Project: HBase
>  Issue Type: Sub-task
>  Components: io
>Affects Versions: 0.96.0
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>  Labels: snapshot
> Fix For: 0.96.0
>
> Attachments: HBASE-6610-v1.patch
>
>
> To avoid copying data during restore snapshot we need to introduce an HFile 
> Link  that allows to reference a file that can be in the original path 
> (/hbase/table/region/cf/hfile) or, if the file is archived, in the archive 
> directory (/hbase/.archive/table/region/cf/hfile).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6610) HFileLink: Hardlink alternative for snapshot restore

2012-08-21 Thread Matteo Bertozzi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-6610:
---

Attachment: HBASE-6610-v1.patch

> HFileLink: Hardlink alternative for snapshot restore
> 
>
> Key: HBASE-6610
> URL: https://issues.apache.org/jira/browse/HBASE-6610
> Project: HBase
>  Issue Type: Sub-task
>  Components: io
>Affects Versions: 0.96.0
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>  Labels: snapshot
> Fix For: 0.96.0
>
> Attachments: HBASE-6610-v1.patch
>
>
> To avoid copying data during restore snapshot we need to introduce an HFile 
> Link  that allows to reference a file that can be in the original path 
> (/hbase/table/region/cf/hfile) or, if the file is archived, in the archive 
> directory (/hbase/.archive/table/region/cf/hfile).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6414) Remove the WritableRpcEngine & associated Invocation classes

2012-08-21 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438904#comment-13438904
 ] 

Devaraj Das commented on HBASE-6414:


bq. I'm fixing it in HBASE-6364. I didn't check if there were other stuff like 
this.

Thanks, Nicolas. Will fix any remaining ones in HBASE-6614.

> Remove the WritableRpcEngine & associated Invocation classes
> 
>
> Key: HBASE-6414
> URL: https://issues.apache.org/jira/browse/HBASE-6414
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.96.0
>Reporter: Devaraj Das
>Assignee: Devaraj Das
> Fix For: 0.96.0
>
> Attachments: 6414-1.patch.txt, 6414-3.patch.txt, 6414-4.patch.txt, 
> 6414-4.patch.txt, 6414-5.patch.txt, 6414-5.patch.txt, 6414-5.patch.txt, 
> 6414-6.patch.txt, 6414-6.patch.txt, 6414-6.txt, 6414-initial.patch.txt, 
> 6414-initial.patch.txt, 6414-v7.txt
>
>
> Remove the WritableRpcEngine & Invocation classes once HBASE-5705 gets 
> committed and all the protocols are rebased to use PB.
> Raising this jira in advance..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6621) Reduce calls to Bytes.toInt

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438921#comment-13438921
 ] 

Hudson commented on HBASE-6621:
---

Integrated in HBase-0.94 #411 (See 
[https://builds.apache.org/job/HBase-0.94/411/])
HBASE-6621 Reduce calls to Bytes.toInt (Revision 1375665)

 Result = FAILURE
larsh : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java


> Reduce calls to Bytes.toInt
> ---
>
> Key: HBASE-6621
> URL: https://issues.apache.org/jira/browse/HBASE-6621
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Minor
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6621-0.96.txt, 6621-0.96-v2.txt, 6621-0.96-v3.txt, 
> 6621-0.96-v4.txt
>
>
> Bytes.toInt shows up quite often in a profiler run.
> It turns out that one source is HFileReaderV2$ScannerV2.getKeyValue().
> Notice that we call the KeyValue(byte[], int) constructor, which forces the 
> constructor to determine its size by reading some of the header information 
> and calculate the size. In this case, however, we already know the size (from 
> the call to readKeyValueLen), so we could just use that.
> In the extreme case of 1's of columns this noticeably reduces CPU. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438931#comment-13438931
 ] 

stack commented on HBASE-6364:
--

+1 on backport so we get benefit of the MTTR work sooner.

@Nkeywal Should we lower the default  ipc.socket.timeout too?  I don't see that 
in your patch?

> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6443) HLogSplitter should ignore 0 length files

2012-08-21 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438937#comment-13438937
 ] 

ramkrishna.s.vasudevan commented on HBASE-6443:
---

I can share you the patch tomorrow.  

> HLogSplitter should ignore 0 length files
> -
>
> Key: HBASE-6443
> URL: https://issues.apache.org/jira/browse/HBASE-6443
> Project: HBase
>  Issue Type: Bug
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Fix For: 0.96.0, 0.94.1
>
>
> Somehow, some WAL files have size 0. Distributed log splitting can't handle 
> it.
> HLogSplitter should ignore them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-643) Rename tables

2012-08-21 Thread Sameer Vaishampayan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438938#comment-13438938
 ] 

Sameer Vaishampayan commented on HBASE-643:
---

@stack Seems this bug can be closed ?

> Rename tables
> -
>
> Key: HBASE-643
> URL: https://issues.apache.org/jira/browse/HBASE-643
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Bieniosek
> Attachments: copy_table.rb, rename_table.rb
>
>
> It would be nice to be able to rename tables, if this is possible.  Some of 
> our internal users are doing things like: upload table mytable -> realize 
> they screwed up -> upload table mytable_2 -> decide mytable_2 looks better -> 
> have to go on using mytable_2 instead of originally desired table name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6607) NullPointerException when accessing master web ui while master is initializing

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438943#comment-13438943
 ] 

Hudson commented on HBASE-6607:
---

Integrated in HBase-TRUNK #3250 (See 
[https://builds.apache.org/job/HBase-TRUNK/3250/])
HBASE-6607 NullPointerException when accessing master web ui while master 
is initializing (Revision 1375673)

 Result = FAILURE
jxiang : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/jamon/org/apache/hadoop/hbase/tmpl/master/MasterStatusTmpl.jamon
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java


> NullPointerException when accessing master web ui while master is initializing
> --
>
> Key: HBASE-6607
> URL: https://issues.apache.org/jira/browse/HBASE-6607
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.96.0
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Trivial
>  Labels: noob
> Fix For: 0.96.0
>
> Attachments: trunk-6607.patch
>
>
> Probably I tried to check the master web ui too soon.  I got some internal 
> error page.  In the master log, there is such exception:
> {noformat}
> 2012-08-17 16:06:25,146 ERROR org.mortbay.log: /master-status
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hbase.master.HMaster.isCatalogJanitorEnabled(HMaster.java:1213)
> at 
> org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:72)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
> at 
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6621) Reduce calls to Bytes.toInt

2012-08-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438944#comment-13438944
 ] 

Hudson commented on HBASE-6621:
---

Integrated in HBase-TRUNK #3250 (See 
[https://builds.apache.org/job/HBase-TRUNK/3250/])
HBASE-6621 Reduce calls to Bytes.toInt (Revision 1375663)

 Result = FAILURE
larsh : 
Files : 
* /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java


> Reduce calls to Bytes.toInt
> ---
>
> Key: HBASE-6621
> URL: https://issues.apache.org/jira/browse/HBASE-6621
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Minor
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6621-0.96.txt, 6621-0.96-v2.txt, 6621-0.96-v3.txt, 
> 6621-0.96-v4.txt
>
>
> Bytes.toInt shows up quite often in a profiler run.
> It turns out that one source is HFileReaderV2$ScannerV2.getKeyValue().
> Notice that we call the KeyValue(byte[], int) constructor, which forces the 
> constructor to determine its size by reading some of the header information 
> and calculate the size. In this case, however, we already know the size (from 
> the call to readKeyValueLen), so we could just use that.
> In the extreme case of 1's of columns this noticeably reduces CPU. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6381) AssignmentManager should use the same logic for clean startup and failover

2012-08-21 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438945#comment-13438945
 ] 

ramkrishna.s.vasudevan commented on HBASE-6381:
---

Reviewed little remaining tomorrow will check this.

> AssignmentManager should use the same logic for clean startup and failover
> --
>
> Key: HBASE-6381
> URL: https://issues.apache.org/jira/browse/HBASE-6381
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: hbase-6381-notes.pdf
>
>
> Currently AssignmentManager handles clean startup and failover very 
> differently.
> Different logic is mingled together so it is hard to find out which is for 
> which.
> We should clean it up and share the same logic so that AssignmentManager 
> handles
> both cases the same way.  This way, the code will much easier to understand 
> and
> maintain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-643) Rename tables

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438946#comment-13438946
 ] 

stack commented on HBASE-643:
-

@Sameer I don't think so.  The rename script was recently removed because it 
had rotted; it had not been updated to match changed API.  We really need 
something like ids for tables, something like what Keith talks of in the above 
so a rename is a near-costless operations (as opposed to the rewrite of .META. 
and HDFS dir name that the rename script used do).

> Rename tables
> -
>
> Key: HBASE-643
> URL: https://issues.apache.org/jira/browse/HBASE-643
> Project: HBase
>  Issue Type: New Feature
>Reporter: Michael Bieniosek
> Attachments: copy_table.rb, rename_table.rb
>
>
> It would be nice to be able to rename tables, if this is possible.  Some of 
> our internal users are doing things like: upload table mytable -> realize 
> they screwed up -> upload table mytable_2 -> decide mytable_2 looks better -> 
> have to go on using mytable_2 instead of originally desired table name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6610) HFileLink: Hardlink alternative for snapshot restore

2012-08-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438948#comment-13438948
 ] 

Hadoop QA commented on HBASE-6610:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12541787/HBASE-6610-v1.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 7 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.TestStore
  org.apache.hadoop.hbase.TestMultiVersions

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2641//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2641//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2641//console

This message is automatically generated.

> HFileLink: Hardlink alternative for snapshot restore
> 
>
> Key: HBASE-6610
> URL: https://issues.apache.org/jira/browse/HBASE-6610
> Project: HBase
>  Issue Type: Sub-task
>  Components: io
>Affects Versions: 0.96.0
>Reporter: Matteo Bertozzi
>Assignee: Matteo Bertozzi
>  Labels: snapshot
> Fix For: 0.96.0
>
> Attachments: HBASE-6610-v1.patch
>
>
> To avoid copying data during restore snapshot we need to introduce an HFile 
> Link  that allows to reference a file that can be in the original path 
> (/hbase/table/region/cf/hfile) or, if the file is archived, in the archive 
> directory (/hbase/.archive/table/region/cf/hfile).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6603) RegionMetricsStorage.incrNumericMetric is called too often

2012-08-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438954#comment-13438954
 ] 

Lars Hofhansl commented on HBASE-6603:
--

Also note that for scans with few columns the collection of this metric still 
shows up and consumes a significant portion of the time.
Are these metric that useful (total get size, total next size)? Can we remove 
them (or at least optionally not collect them)?

> RegionMetricsStorage.incrNumericMetric is called too often
> --
>
> Key: HBASE-6603
> URL: https://issues.apache.org/jira/browse/HBASE-6603
> Project: HBase
>  Issue Type: Bug
>  Components: performance
>Reporter: Lars Hofhansl
>Assignee: M. Chen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6503-0.96.txt, 6603-0.94.txt
>
>
> Running an HBase scan load through the profiler revealed that 
> RegionMetricsStorage.incrNumericMetric is called way too often.
> It turns out that we make this call for *each* KV in StoreScanner.next(...).
> Incrementing AtomicLong requires expensive memory barriers.
> The observation here is that StoreScanner.next(...) can maintain a simple 
> long in its internal loop and only update the metric upon exit. Thus the 
> AtomicLong is not updated nearly as often.
> That cuts about 10% runtime from scan only load (I'll quantify this better 
> soon).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6603) RegionMetricsStorage.incrNumericMetric is called too often

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438958#comment-13438958
 ] 

stack commented on HBASE-6603:
--

@Lars Thats another issue?

> RegionMetricsStorage.incrNumericMetric is called too often
> --
>
> Key: HBASE-6603
> URL: https://issues.apache.org/jira/browse/HBASE-6603
> Project: HBase
>  Issue Type: Bug
>  Components: performance
>Reporter: Lars Hofhansl
>Assignee: M. Chen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6503-0.96.txt, 6603-0.94.txt
>
>
> Running an HBase scan load through the profiler revealed that 
> RegionMetricsStorage.incrNumericMetric is called way too often.
> It turns out that we make this call for *each* KV in StoreScanner.next(...).
> Incrementing AtomicLong requires expensive memory barriers.
> The observation here is that StoreScanner.next(...) can maintain a simple 
> long in its internal loop and only update the metric upon exit. Thus the 
> AtomicLong is not updated nearly as often.
> That cuts about 10% runtime from scan only load (I'll quantify this better 
> soon).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5449) Support for wire-compatible security functionality

2012-08-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438965#comment-13438965
 ] 

Hadoop QA commented on HBASE-5449:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12541785/HBASE-5449-v2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 5 javac compiler warnings (more than 
the trunk's current 4 warnings).

-1 findbugs.  The patch appears to introduce 7 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.backup.example.TestZooKeeperTableArchiveClient

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2642//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2642//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2642//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2642//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2642//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2642//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2642//console

This message is automatically generated.

> Support for wire-compatible security functionality
> --
>
> Key: HBASE-5449
> URL: https://issues.apache.org/jira/browse/HBASE-5449
> Project: HBase
>  Issue Type: Sub-task
>  Components: ipc, master, migration, regionserver
>Reporter: Todd Lipcon
>Assignee: Matteo Bertozzi
> Attachments: AccessControl_protos.patch, HBASE-5449-v0.patch, 
> HBASE-5449-v1.patch, HBASE-5449-v2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438968#comment-13438968
 ] 

nkeywal commented on HBASE-6364:


bq. +1 on backport so we get benefit of the MTTR work sooner.
I can do it if you like.

bq. @Nkeywal Should we lower the default ipc.socket.timeout too? I don't see 
that in your patch?
Doable, there will be no side effect, it's used only in HBaseClient. How do you 
want it:
- safe: 10 seconds
- reasonably aggressive: 5 seconds
- instructive: 1 second

20 seconds is a common timeout value (used in the OS if I remember well). Not 
subject to GC effects. I don't know for large clusters under large failure 
conditions (with switches). Especially, in this case, it will be all the 
clients connecting to a single server (the one with meta), so likely going 
through a single network link at the end. I would vote 10s; but 5s is doable. 
1s could work, who knows?


> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6524) Hooks for hbase tracing

2012-08-21 Thread Jonathan Leavitt (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Leavitt updated HBASE-6524:


Attachment: createTableTrace.png

A trace of a createTable. 

> Hooks for hbase tracing
> ---
>
> Key: HBASE-6524
> URL: https://issues.apache.org/jira/browse/HBASE-6524
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jonathan Leavitt
> Attachments: createTableTrace.png, hbase-6524.diff
>
>
> Includes the hooks that use [htrace|http://www.github.com/cloudera/htrace] 
> library to add dapper-like tracing to hbase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6524) Hooks for hbase tracing

2012-08-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438973#comment-13438973
 ] 

Hadoop QA commented on HBASE-6524:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12541798/createTableTrace.png
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2643//console

This message is automatically generated.

> Hooks for hbase tracing
> ---
>
> Key: HBASE-6524
> URL: https://issues.apache.org/jira/browse/HBASE-6524
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jonathan Leavitt
> Attachments: createTableTrace.png, hbase-6524.diff
>
>
> Includes the hooks that use [htrace|http://www.github.com/cloudera/htrace] 
> library to add dapper-like tracing to hbase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6524) Hooks for hbase tracing

2012-08-21 Thread Jonathan Leavitt (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438980#comment-13438980
 ] 

Jonathan Leavitt commented on HBASE-6524:
-

@Stack. I uploaded another image.  As you may notice, the descriptions on this 
image are not as descriptive as they are in the first image I uploaded.  This 
is because the toString() on HBaseServer.Call is different with the 
ProtobufRpcEngine.  It now just concatenates the toString of the rpcRequestBody 
with the toString of the connection. Unfortunately the default protobuf 
toString does not give the most useful information.  You can still make out the 
basics of the requests.

bq. Yeah, that sounds good. Can do that in new JIRA?
Yes. I think that this commit is good for including htrace in maven 
dependencies and the basic instrumentation, and additional instrumentation 
should go in other JIRAs. 

{quote}I just think this a topic that will be of general interest.
For refguide, a sentence or two w/ pointers on how to set it all up would be 
great. {quote}

I'll try to have some sort of blog post and ref guide stuff done today or 
tomorrow. 

bq. I think we should commit this if its fine by you.
Sounds good. 



> Hooks for hbase tracing
> ---
>
> Key: HBASE-6524
> URL: https://issues.apache.org/jira/browse/HBASE-6524
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jonathan Leavitt
> Attachments: createTableTrace.png, hbase-6524.diff
>
>
> Includes the hooks that use [htrace|http://www.github.com/cloudera/htrace] 
> library to add dapper-like tracing to hbase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-899) Support for specifying a timestamp and numVersions on a per-column basis

2012-08-21 Thread Sameer Vaishampayan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439001#comment-13439001
 ] 

Sameer Vaishampayan commented on HBASE-899:
---

@Jonathan, Any update on this bug ? Given that it was supposed to be solved as 
part of 1249 is it now "closeable" ?

> Support for specifying a timestamp and numVersions on a per-column basis
> 
>
> Key: HBASE-899
> URL: https://issues.apache.org/jira/browse/HBASE-899
> Project: HBase
>  Issue Type: New Feature
>Reporter: Doğacan Güney
>
> This is just an idea and it may be better to wait after the planned API 
> changes. But I think it would be useful to support fetching different 
> timestamps and versions for different columns.
> Example:
> If a row has 2 columns, "col1:" and "col2:" I want to be able to ask for 
> (during scan or read time, doesn't matter) 2 versions of "col1:" (maybe even 
> between timestamps t1 and t2) but only 1 version of "col2:". This would be 
> especially handy if during an MR job you have to read 2 versions of a small 
> column, but do not want the overhead of reading 2 versions of every other 
> column too
> (Also, the mechanism is already there. I mean, making the changes to support 
> a per-column timestamp/numVersions is  ridiculously easy :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6045) copytable: change --peer.adr to --peer.addr

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6045:
-

Attachment: hbase-6045.patch

Retry

> copytable: change --peer.adr to --peer.addr
> ---
>
> Key: HBASE-6045
> URL: https://issues.apache.org/jira/browse/HBASE-6045
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
>Priority: Trivial
> Attachments: hbase-6045-94.patch, hbase-6045.patch, hbase-6045.patch, 
> hbase-6045.patch, hbase-6045.patch
>
>
> In discussion of HBASE-6013 it was suggested that we change --peer.adr to the 
> more english centric --peer.addr.  We would keep the old value present in 
> 0.90/0.92/0.94 and remove from 0.96/trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6045) copytable: change --peer.adr to --peer.addr

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6045:
-

Status: Open  (was: Patch Available)

> copytable: change --peer.adr to --peer.addr
> ---
>
> Key: HBASE-6045
> URL: https://issues.apache.org/jira/browse/HBASE-6045
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
>Priority: Trivial
> Attachments: hbase-6045-94.patch, hbase-6045.patch, hbase-6045.patch, 
> hbase-6045.patch, hbase-6045.patch
>
>
> In discussion of HBASE-6013 it was suggested that we change --peer.adr to the 
> more english centric --peer.addr.  We would keep the old value present in 
> 0.90/0.92/0.94 and remove from 0.96/trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6103) Optimize the HBaseServer to deserialize the data for each ipc connection in parallel

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6103:
-

  Component/s: performance
 Priority: Critical  (was: Major)
Fix Version/s: 0.96.0

Bringing into 0.96.  See if we can forward port this Liyin fix.

> Optimize the HBaseServer to deserialize the data for each ipc connection in 
> parallel
> 
>
> Key: HBASE-6103
> URL: https://issues.apache.org/jira/browse/HBASE-6103
> Project: HBase
>  Issue Type: Improvement
>  Components: performance
>Reporter: Liyin Tang
>Assignee: Liyin Tang
>Priority: Critical
> Fix For: 0.96.0
>
> Attachments: HBASE-6103-fb-89.patch
>
>
> Currently HBaseServer is running with a single listener thread, which is 
> responsible for accepting the connection, reading the data from network 
> channel, deserializing the data into writable objects and handover to the IPC 
> handler threads. 
> When there are multiple hbase clients connecting to the region server 
> (HBaseServer) and reading/writing a large set of data, the listener and the 
> respond thread will be performance bottleneck. 
> So the solution is to deserialize the data for each ipc connection in 
> parallel for HBaseServer
> BTW, it is also one of the reasons that the parallel scanning from multiple 
> clients is far slower than single client case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6371) Level based compaction

2012-08-21 Thread Nicolas Spiegelberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439025#comment-13439025
 ] 

Nicolas Spiegelberg commented on HBASE-6371:


@Lars: I think we want to put level-based & tiered compactions in the core 
instead of as coprocessors because these are generic strategies versus 
app-specific logic.

@Akashnil: the algorithm you describe is technically referred to as a "tiered 
compaction".  DataStax has a nice writeup on tiered compactions versus 
level-based: 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

> Level based compaction
> --
>
> Key: HBASE-6371
> URL: https://issues.apache.org/jira/browse/HBASE-6371
> Project: HBase
>  Issue Type: Improvement
>Reporter: Akashnil
>Assignee: Akashnil
>
> Currently, the compaction selection is not very flexible and is not sensitive 
> to the hotness of the data. Very old data is likely to be accessed less, and 
> very recent data is likely to be in the block cache. Both of these 
> considerations make it inefficient to compact these files as aggressively as 
> other files. In some use-cases, the access-pattern is particularly obvious 
> even though there is no way to control the compaction algorithm in those 
> cases.
> In the new compaction selection algorithm, we plan to divide the candidate 
> files into different levels according to oldness of the data that is present 
> in those files. For each level, parameters like compaction ratio, minimum 
> number of store-files in each compaction may be different. Number of levels, 
> time-ranges, and parameters for each level will be configurable online on a 
> per-column family basis.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6628) Add HBASE-6059 to 0.94 branch

2012-08-21 Thread stack (JIRA)
stack created HBASE-6628:


 Summary: Add HBASE-6059 to 0.94 branch
 Key: HBASE-6628
 URL: https://issues.apache.org/jira/browse/HBASE-6628
 Project: HBase
  Issue Type: Task
Reporter: stack
 Fix For: 0.94.2


Look at adding HBASE-6059 to 0.94.  Its in trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439037#comment-13439037
 ] 

stack commented on HBASE-4676:
--

You need review on this Matt?

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
> as often as it is now.
> Current code is on github.  The trie code is in a separate project than the 
> slightly modified hbase.

[jira] [Updated] (HBASE-6000) Cleanup where we keep .proto files

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6000:
-

Status: Open  (was: Patch Available)

Patch no longer applicable.  Need new patch to move the stuff at 
src/main/resouces to src/main/protobuf:

{code}
./hbase-server/src/main/protobuf/Admin.proto
./hbase-server/src/main/protobuf/Client.proto
./hbase-server/src/main/protobuf/ClusterId.proto
./hbase-server/src/main/protobuf/ClusterStatus.proto
./hbase-server/src/main/protobuf/Filter.proto
./hbase-server/src/main/protobuf/FS.proto
./hbase-server/src/main/protobuf/hbase.proto
./hbase-server/src/main/protobuf/Master.proto
./hbase-server/src/main/protobuf/MasterAdmin.proto
./hbase-server/src/main/protobuf/MasterMonitor.proto
./hbase-server/src/main/protobuf/RegionServerStatus.proto
./hbase-server/src/main/protobuf/RPC.proto
./hbase-server/src/main/protobuf/ZooKeeper.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/CellMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/CellSetMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/ColumnSchemaMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/ScannerMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/StorageClusterStatusMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/TableInfoMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/TableListMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/TableSchemaMessage.proto
./hbase-server/src/main/resources/org/apache/hadoop/hbase/rest/protobuf/VersionMessage.proto
./hbase-server/src/test/protobuf/test.proto
./hbase-server/src/test/protobuf/test_delayed_rpc.proto
./hbase-server/src/test/protobuf/test_rpc_service.proto
{code}

> Cleanup where we keep .proto files
> --
>
> Key: HBASE-6000
> URL: https://issues.apache.org/jira/browse/HBASE-6000
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.96.0
>Reporter: stack
>Assignee: stack
>  Labels: noob
> Attachments: 6000.txt, 6000.txt
>
>
> I see Andrew for his pb work over in rest has .protos files under 
> src/main/resources.  We should unify where these files live.  The recently 
> added .protos place them under src/main/protobuf  Its confusing.
> The thift idl files are here under resources too.
> Seems like we should move src/main/protobuf under src/resources to be 
> consistent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6000) Cleanup where we keep .proto files

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6000:
-

  Tags: noob
Labels: noob  (was: )

> Cleanup where we keep .proto files
> --
>
> Key: HBASE-6000
> URL: https://issues.apache.org/jira/browse/HBASE-6000
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.96.0
>Reporter: stack
>Assignee: stack
>  Labels: noob
> Attachments: 6000.txt, 6000.txt
>
>
> I see Andrew for his pb work over in rest has .protos files under 
> src/main/resources.  We should unify where these files live.  The recently 
> added .protos place them under src/main/protobuf  Its confusing.
> The thift idl files are here under resources too.
> Seems like we should move src/main/protobuf under src/resources to be 
> consistent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6189) Snappy build instructions are out of date

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6189:
-

 Priority: Critical  (was: Major)
Affects Version/s: 0.96.0

> Snappy build instructions are out of date
> -
>
> Key: HBASE-6189
> URL: https://issues.apache.org/jira/browse/HBASE-6189
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.1, 0.94.0, 0.96.0
>Reporter: Dave Revell
>Priority: Critical
>
> In the ref guide (http://hbase.apache.org/book.html#build.snappy), it says to 
> build snappy by passing -Dsnappy. Something's wrong here, because:
> 1. this causes the build to fail because the hadoop-snappy tar artifact can't 
> be found by maven
> 2. the snappy classes are already included in hadoop 1.0, so using the snappy 
> profile is unnecessary
> It would be great if someone who knows when/why to use the snappy profile 
> could fix the instructions (and fix the POM if necessary).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6603) RegionMetricsStorage.incrNumericMetric is called too often

2012-08-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439054#comment-13439054
 ] 

Lars Hofhansl commented on HBASE-6603:
--

@Stack: For sure, just wanted to gauge what the general opinion is.


> RegionMetricsStorage.incrNumericMetric is called too often
> --
>
> Key: HBASE-6603
> URL: https://issues.apache.org/jira/browse/HBASE-6603
> Project: HBase
>  Issue Type: Bug
>  Components: performance
>Reporter: Lars Hofhansl
>Assignee: M. Chen
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6503-0.96.txt, 6603-0.94.txt
>
>
> Running an HBase scan load through the profiler revealed that 
> RegionMetricsStorage.incrNumericMetric is called way too often.
> It turns out that we make this call for *each* KV in StoreScanner.next(...).
> Incrementing AtomicLong requires expensive memory barriers.
> The observation here is that StoreScanner.next(...) can maintain a simple 
> long in its internal loop and only update the metric upon exit. Thus the 
> AtomicLong is not updated nearly as often.
> That cuts about 10% runtime from scan only load (I'll quantify this better 
> soon).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6524) Hooks for hbase tracing

2012-08-21 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-6524:
-

   Resolution: Fixed
Fix Version/s: 0.96.0
 Release Note: Adds hooks so can track rpcs using htrace
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Committed to trunk.  Thanks for the patch Jonathan.

Would suggest you file an issue on the protobuf toString issue.  If you add a 
line or two for the refguide here, I'll commit that too.  Thanks boss.

> Hooks for hbase tracing
> ---
>
> Key: HBASE-6524
> URL: https://issues.apache.org/jira/browse/HBASE-6524
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jonathan Leavitt
> Fix For: 0.96.0
>
> Attachments: createTableTrace.png, hbase-6524.diff
>
>
> Includes the hooks that use [htrace|http://www.github.com/cloudera/htrace] 
> library to add dapper-like tracing to hbase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439080#comment-13439080
 ] 

stack commented on HBASE-6364:
--

20 is the current default?  If so, we can leave it.  Should we add a note to 
perf section of refguide on lowering connection timeout?

That'd be grand if you'd commit to 0.94.

> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6364) Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table

2012-08-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439082#comment-13439082
 ] 

nkeywal commented on HBASE-6364:


For connect timeout, yes it's 20s. We could add a note as well, I will create a 
jira for this. And commit in 0.94.

> Powering down the server host holding the .META. table causes HBase Client to 
> take excessively long to recover and connect to reassigned .META. table
> -
>
> Key: HBASE-6364
> URL: https://issues.apache.org/jira/browse/HBASE-6364
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.6, 0.92.1, 0.94.0
>Reporter: Suraj Varma
>Assignee: nkeywal
>  Labels: client
> Fix For: 0.96.0
>
> Attachments: 6364-host-serving-META.v1.patch, 
> 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, 
> 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, 
> 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, 
> 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt
>
>
> When a server host with a Region Server holding the .META. table is powered 
> down on a live cluster, while the HBase cluster itself detects and reassigns 
> the .META. table, connected HBase Client's take an excessively long time to 
> detect this and re-discover the reassigned .META. 
> Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low  
> value (default is 20s leading to 35 minute recovery time; we were able to get 
> acceptable results with 100ms getting a 3 minute recovery) 
> This was found during some hardware failure testing scenarios. 
> Test Case:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server (i.e. power off ... 
> and keep it off)
> 3) Measure how long it takes for cluster to reassign META table and for 
> client threads to re-lookup and re-orient to the lesser cluster (minus the RS 
> and DN on that host).
> Observation:
> 1) Client threads spike up to maxThreads size ... and take over 35 mins to 
> recover (i.e. for the thread count to go back to normal) - no client calls 
> are serviced - they just back up on a synchronized method (see #2 below)
> 2) All the client app threads queue up behind the 
> oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj
> After taking several thread dumps we found that the thread within this 
> synchronized method was blocked on  NetUtils.connect(this.socket, 
> remoteId.getAddress(), getSocketTimeout(conf));
> The client thread that gets the synchronized lock would try to connect to the 
> dead RS (till socket times out after 20s), retries, and then the next thread 
> gets in and so forth in a serial manner.
> Workaround:
> ---
> Default ipc.socket.timeout is set to 20s. We dropped this to a low number 
> (1000 ms,  100 ms, etc) on the client side hbase-site.xml. With this setting, 
> the client threads recovered in a couple of minutes by failing fast and 
> re-discovering the .META. table on a reassigned RS.
> Assumption: This ipc.socket.timeout is only ever used during the initial 
> "HConnection" setup via the NetUtils.connect and should only ever be used 
> when connectivity to a region server is lost and needs to be re-established. 
> i.e it does not affect the normal "RPC" actiivity as this is just the connect 
> timeout.
> During RS GC periods, any _new_ clients trying to connect will fail and will 
> require .META. table re-lookups.
> This above timeout workaround is only for the HBase client side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6629) [0.89-fb] Fast fail client operations if the regionserver is repeatedly unreachable.

2012-08-21 Thread Amitanand Aiyer (JIRA)
Amitanand Aiyer created HBASE-6629:
--

 Summary: [0.89-fb] Fast fail client operations if the regionserver 
is repeatedly unreachable.
 Key: HBASE-6629
 URL: https://issues.apache.org/jira/browse/HBASE-6629
 Project: HBase
  Issue Type: Bug
Reporter: Amitanand Aiyer
Priority: Minor


We have seen occassional RSW reboots in the production cluster. 

On the client end, the reduction in the operation throughput is a lot more than 
the %-age of nodes disconnected. This is because operations to the disconnected 
machines are holding up resources that could otherwise be used for successful 
operation.

This change enables the client to detect when there are repeated failures to a 
regionserver, and fast fail operations so we do not hold up resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6621) Reduce calls to Bytes.toInt

2012-08-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439085#comment-13439085
 ] 

Lars Hofhansl commented on HBASE-6621:
--

Here's another observation. In ScanQueryMatcher.match we have this:
{code}
byte [] bytes = kv.getBuffer();
int offset = kv.getOffset();
int initialOffset = offset;

int keyLength = Bytes.toInt(bytes, offset, Bytes.SIZEOF_INT);
{code}

At this point the passed kv already has its keyLength cached. So we can use 
{code}int keyLength = kv.getKeyLength();{code} instead a save a few more cycles.
This has measurable effect with many columns (~3%).

A simple 1-line change. Any opposition doing this here as well, or should I 
open a new issue.

> Reduce calls to Bytes.toInt
> ---
>
> Key: HBASE-6621
> URL: https://issues.apache.org/jira/browse/HBASE-6621
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Minor
> Fix For: 0.96.0, 0.94.2
>
> Attachments: 6621-0.96.txt, 6621-0.96-v2.txt, 6621-0.96-v3.txt, 
> 6621-0.96-v4.txt
>
>
> Bytes.toInt shows up quite often in a profiler run.
> It turns out that one source is HFileReaderV2$ScannerV2.getKeyValue().
> Notice that we call the KeyValue(byte[], int) constructor, which forces the 
> constructor to determine its size by reading some of the header information 
> and calculate the size. In this case, however, we already know the size (from 
> the call to readKeyValueLen), so we could just use that.
> In the extreme case of 1's of columns this noticeably reduces CPU. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-08-21 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439090#comment-13439090
 ] 

Matt Corgan commented on HBASE-4676:


Barely been able to touch it this whole summer, but I spent a little time in 
july rebasing to trunk and reorganizing for readability.  It's on github on a 
branch called "prefix-tree" (i renamed from trie since people wince when 
pronouncing it).  https://github.com/hotpads/hbase/tree/prefix-tree

What's up there is pretty good and workable, but i still have some things to do 
and some questions to ask for things like comparators, sequenceIds, ROOT and 
META handling.  Also needs more documentation.  I could use some high level 
feedback if you can get anything out of it without a lot of docs/comments.  
Maybe look at the things in the hbase-common module to start since that will 
affect overall hbase?  The prefix-trie module is more implementation details 
that only matter when you enable this specific encoding.

btw - current version has an "HCell" interface which i will rename to Cell(?) 
after seeing the recent discussion about HStore naming.

Happy to post on reviewboard as well, but might have to wait till the weekend.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the sam

[jira] [Created] (HBASE-6630) Port HBASE-6590 to trunk 0.94 : Assign sequence number to bulk loaded files

2012-08-21 Thread Amitanand Aiyer (JIRA)
Amitanand Aiyer created HBASE-6630:
--

 Summary: Port HBASE-6590 to trunk 0.94 : Assign sequence number to 
bulk loaded files
 Key: HBASE-6630
 URL: https://issues.apache.org/jira/browse/HBASE-6630
 Project: HBase
  Issue Type: Sub-task
Reporter: Amitanand Aiyer
Assignee: Amitanand Aiyer
Priority: Minor




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >