[jira] [Commented] (HBASE-3562) ValueFilter is being evaluated before performing the column match

2011-03-25 Thread Evert Arckens (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011175#comment-13011175
 ] 

Evert Arckens commented on HBASE-3562:
--

In ScanQueryMatcher.match I would do the columns.checkColumn call first and 
only if that returns MatchCode.INCLUDE execute the filters. I think this would 
be more efficient as well since calculating to skip a column or not will 
usually be faster than evaluating one ore more filters.

However, in the code is mentioned explicitly : 
/**
 * Filters should be checked before checking column trackers. If we do
 * otherwise, as was previously being done, ColumnTracker may increment its
 * counter for even that KV which may be discarded later on by Filter. This
 * would lead to incorrect results in certain cases.
 */

It is not completely clear to me what the exact purpose of the counter on the 
ColumnTracker is or what the problem would be if it was incremented.
Maybe calling ((ExplicitColumnTracker)columns).doneWithColumn (like is done in 
getNextRowOrNextColumn) explicitly when a filter skips a column can help here?

> ValueFilter is being evaluated before performing the column match
> -
>
> Key: HBASE-3562
> URL: https://issues.apache.org/jira/browse/HBASE-3562
> Project: HBase
>  Issue Type: Bug
>  Components: filters
>Affects Versions: 0.90.0
>Reporter: Evert Arckens
>
> When performing a Get operation where a both a column is specified and a 
> ValueFilter, the ValueFilter is evaluated before making the column match as 
> is indicated in the javadoc of Get.setFilter()  : " {@link 
> Filter#filterKeyValue(KeyValue)} is called AFTER all tests for ttl, column 
> match, deletes and max versions have been run. "
> The is shown in the little test below, which uses a TestComparator extending 
> a WritableByteArrayComparable.
> public void testFilter() throws Exception {
>   byte[] cf = Bytes.toBytes("cf");
>   byte[] row = Bytes.toBytes("row");
>   byte[] col1 = Bytes.toBytes("col1");
>   byte[] col2 = Bytes.toBytes("col2");
>   Put put = new Put(row);
>   put.add(cf, col1, new byte[]{(byte)1});
>   put.add(cf, col2, new byte[]{(byte)2});
>   table.put(put);
>   Get get = new Get(row);
>   get.addColumn(cf, col2); // We only want to retrieve col2
>   TestComparator testComparator = new TestComparator();
>   Filter filter = new ValueFilter(CompareOp.EQUAL, testComparator);
>   get.setFilter(filter);
>   Result result = table.get(get);
> }
> public class TestComparator extends WritableByteArrayComparable {
> /**
>  * Nullary constructor, for Writable
>  */
> public TestComparator() {
> super();
> }
> 
> @Override
> public int compareTo(byte[] theirValue) {
> if (theirValue[0] == (byte)1) {
> // If the column match was done before evaluating the filter, we 
> should never get here.
> throw new RuntimeException("I only expect (byte)2 in col2, not 
> (byte)1 from col1");
> }
> if (theirValue[0] == (byte)2) {
> return 0;
> }
> else return 1;
> }
> }
> When only one column should be retrieved, this can be worked around by using 
> a SingleColumnValueFilter instead of the ValueFilter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3686) Scanner timeout on RegionServer but Client won't know what happened

2011-03-25 Thread Sean Sechrist (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011203#comment-13011203
 ] 

Sean Sechrist commented on HBASE-3686:
--

I did a little more testing and it turns out this problem isn't limited to the 
misconfiguration.

You'll also lose rows if you kill -9 a region server in the middle of scan. In 
HTable.ClientScanner.next(), there's this skipFirst boolean that is supposed to 
skip the first row that was "already let out on a previous invocation". But 
instead of just skipping the first row, 
getConnection().getRegionServerWithRetries(callable) is called an extra time, 
which will skip [caching] rows.

So I think fixing it to only skip 1 row will also fixing the problem if there's 
a misconfiguration, so sending the timeout to the server won't be needed.

> Scanner timeout on RegionServer but Client won't know what happened
> ---
>
> Key: HBASE-3686
> URL: https://issues.apache.org/jira/browse/HBASE-3686
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.89.20100924
>Reporter: Sean Sechrist
>Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: 
> http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we 
> can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's 
> hbase.regionserver.lease.period config, so it doesn't throw a 
> ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout 
> is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3562) ValueFilter is being evaluated before performing the column match

2011-03-25 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011248#comment-13011248
 ] 

Jonathan Gray commented on HBASE-3562:
--

The counter in ColumnTracker is responsible for tracking setMaxVersions.  You 
may have queried for only the latest version, so once the ColumnTracker sees a 
given column, it will reject subsequent version of that columns.  Currently 
there's no way for the CT to know that subsequent filters actually prevented it 
from being returned so it should not be included in the count of returned 
versions.

We would need to introduce something like {{skippedPreviousKeyValue}} that 
could be sent back to the CT so it could undo the previous count.

> ValueFilter is being evaluated before performing the column match
> -
>
> Key: HBASE-3562
> URL: https://issues.apache.org/jira/browse/HBASE-3562
> Project: HBase
>  Issue Type: Bug
>  Components: filters
>Affects Versions: 0.90.0
>Reporter: Evert Arckens
>
> When performing a Get operation where a both a column is specified and a 
> ValueFilter, the ValueFilter is evaluated before making the column match as 
> is indicated in the javadoc of Get.setFilter()  : " {@link 
> Filter#filterKeyValue(KeyValue)} is called AFTER all tests for ttl, column 
> match, deletes and max versions have been run. "
> The is shown in the little test below, which uses a TestComparator extending 
> a WritableByteArrayComparable.
> public void testFilter() throws Exception {
>   byte[] cf = Bytes.toBytes("cf");
>   byte[] row = Bytes.toBytes("row");
>   byte[] col1 = Bytes.toBytes("col1");
>   byte[] col2 = Bytes.toBytes("col2");
>   Put put = new Put(row);
>   put.add(cf, col1, new byte[]{(byte)1});
>   put.add(cf, col2, new byte[]{(byte)2});
>   table.put(put);
>   Get get = new Get(row);
>   get.addColumn(cf, col2); // We only want to retrieve col2
>   TestComparator testComparator = new TestComparator();
>   Filter filter = new ValueFilter(CompareOp.EQUAL, testComparator);
>   get.setFilter(filter);
>   Result result = table.get(get);
> }
> public class TestComparator extends WritableByteArrayComparable {
> /**
>  * Nullary constructor, for Writable
>  */
> public TestComparator() {
> super();
> }
> 
> @Override
> public int compareTo(byte[] theirValue) {
> if (theirValue[0] == (byte)1) {
> // If the column match was done before evaluating the filter, we 
> should never get here.
> throw new RuntimeException("I only expect (byte)2 in col2, not 
> (byte)1 from col1");
> }
> if (theirValue[0] == (byte)2) {
> return 0;
> }
> else return 1;
> }
> }
> When only one column should be retrieved, this can be worked around by using 
> a SingleColumnValueFilter instead of the ValueFilter.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3654) Weird blocking between getOnlineRegion and createRegionLoad

2011-03-25 Thread Subbu M Iyer (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011261#comment-13011261
 ] 

Subbu M Iyer commented on HBASE-3654:
-

1. Added the null check for createRegionLoad.
2. Pre sized the size of ArrayList to onlineRegions.size during 
getOnlineRegions call to minimize the locking 
   at java.util.Arrays.copyOf(Arrays.java:2734).


> Weird blocking between getOnlineRegion and createRegionLoad
> ---
>
> Key: HBASE-3654
> URL: https://issues.apache.org/jira/browse/HBASE-3654
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: Subbu M Iyer
>Priority: Blocker
> Fix For: 0.90.2
>
> Attachments: ConcurrentHM, ConcurrentSKLM, CopyOnWrite, 
> HBASE-3654-ConcurrentHashMap-RemoveGetSync.patch, 
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL1.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM1.patch,
>  TestOnlineRegions.java, hashmap
>
>
> Saw this when debugging something else:
> {code}
> "regionserver60020" prio=10 tid=0x7f538c1c nid=0x4c7 runnable 
> [0x7f53931da000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1380)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionLoad(HRegionServer.java:916)
>   - locked <0x000672aa0a00> (a 
> java.util.concurrent.ConcurrentSkipListMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad(HRegionServer.java:767)
>   - locked <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:722)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:591)
>   at java.lang.Thread.run(Thread.java:662)
> "IPC Reader 9 on port 60020" prio=10 tid=0x7f538c1be000 nid=0x4c6 waiting 
> for monitor entry [0x7f53932db000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLevel(HBaseServer.java:1126)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:982)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:946)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>   - locked <0x000656e60068> (a 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> ...
> "IPC Reader 0 on port 60020" prio=10 tid=0x7f538c08b000 nid=0x4bd waiting 
> for monitor entry [0x7f5393be4000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLe

[jira] [Updated] (HBASE-3654) Weird blocking between getOnlineRegion and createRegionLoad

2011-03-25 Thread Subbu M Iyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subbu M Iyer updated HBASE-3654:


Attachment: 
HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM1.patch

> Weird blocking between getOnlineRegion and createRegionLoad
> ---
>
> Key: HBASE-3654
> URL: https://issues.apache.org/jira/browse/HBASE-3654
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: Subbu M Iyer
>Priority: Blocker
> Fix For: 0.90.2
>
> Attachments: ConcurrentHM, ConcurrentSKLM, CopyOnWrite, 
> HBASE-3654-ConcurrentHashMap-RemoveGetSync.patch, 
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL1.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM1.patch,
>  TestOnlineRegions.java, hashmap
>
>
> Saw this when debugging something else:
> {code}
> "regionserver60020" prio=10 tid=0x7f538c1c nid=0x4c7 runnable 
> [0x7f53931da000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1380)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionLoad(HRegionServer.java:916)
>   - locked <0x000672aa0a00> (a 
> java.util.concurrent.ConcurrentSkipListMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad(HRegionServer.java:767)
>   - locked <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:722)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:591)
>   at java.lang.Thread.run(Thread.java:662)
> "IPC Reader 9 on port 60020" prio=10 tid=0x7f538c1be000 nid=0x4c6 waiting 
> for monitor entry [0x7f53932db000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLevel(HBaseServer.java:1126)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:982)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:946)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>   - locked <0x000656e60068> (a 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> ...
> "IPC Reader 0 on port 60020" prio=10 tid=0x7f538c08b000 nid=0x4bd waiting 
> for monitor entry [0x7f5393be4000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLevel(HBaseServer.java:1126)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:982)
>   at 
> org.apache.hadoop.hbas

[jira] [Commented] (HBASE-3695) Some improvements to Hbck to test the entire region chain in Meta and provide better error reporting

2011-03-25 Thread Marc Limotte (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011262#comment-13011262
 ] 

Marc Limotte commented on HBASE-3695:
-

@Stack Yea. That one is a dupe.  It was earlier, but it is only a subset of 
what's here.

Separately, I think there should be a distinct JIRA to cover the merge tool 
issue (it doesn't work with the low, overlapping sequence numbers generated by 
the bulk load tool). 

Also (possibly another JIRA), I wonder if the ability to manually create a 
region should be exposed (through the shell or other tool).  We had to do this 
a couple of times to resolve the issues that had come up; but, on the other 
hand, if the merge tool worked in those scenarios, maybe this wouldn't have 
been necessary.


> Some improvements to Hbck to test the entire region chain in Meta and provide 
> better error reporting
> 
>
> Key: HBASE-3695
> URL: https://issues.apache.org/jira/browse/HBASE-3695
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 0.90.1
>Reporter: Marc Limotte
>Assignee: Marc Limotte
> Fix For: 0.90.3
>
>
> The current Hbck tool will miss some inconsistencies in Meta, and in other 
> cases will detect an issue, but does not provide much in the way of useful 
> feedback.  
> * Incorporate the full region chain tests (similar to check_meta.rb). I.e. 
> look for overlaps, holes and cycles. I believe check_meta.rb will be 
> redundant after this change.
> * More unit tests, and better tests that will test the actual error 
> discovered, instead of just errors true/false.
> * In the case of overlaps and holes, output both ends of the broken chain.
> * Previous implementation runs check() twice.  This is inefficient and, more 
> importantly, reports redundant errors which could be confusing to the user.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3654) Weird blocking between getOnlineRegion and createRegionLoad

2011-03-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011275#comment-13011275
 ] 

Ted Yu commented on HBASE-3654:
---

Thanks for the continuing effort.
For item 2, we should also add check against null pointer in regionserver.jsp:
{code}
HServerLoad.RegionLoad load = 
regionServer.createRegionLoad(r.getEncodedName());
...
<%= load.toString() %>
{code}


> Weird blocking between getOnlineRegion and createRegionLoad
> ---
>
> Key: HBASE-3654
> URL: https://issues.apache.org/jira/browse/HBASE-3654
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: Subbu M Iyer
>Priority: Blocker
> Fix For: 0.90.2
>
> Attachments: ConcurrentHM, ConcurrentSKLM, CopyOnWrite, 
> HBASE-3654-ConcurrentHashMap-RemoveGetSync.patch, 
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL1.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM1.patch,
>  TestOnlineRegions.java, hashmap
>
>
> Saw this when debugging something else:
> {code}
> "regionserver60020" prio=10 tid=0x7f538c1c nid=0x4c7 runnable 
> [0x7f53931da000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1380)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionLoad(HRegionServer.java:916)
>   - locked <0x000672aa0a00> (a 
> java.util.concurrent.ConcurrentSkipListMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad(HRegionServer.java:767)
>   - locked <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:722)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:591)
>   at java.lang.Thread.run(Thread.java:662)
> "IPC Reader 9 on port 60020" prio=10 tid=0x7f538c1be000 nid=0x4c6 waiting 
> for monitor entry [0x7f53932db000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLevel(HBaseServer.java:1126)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:982)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:946)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>   - locked <0x000656e60068> (a 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> ...
> "IPC Reader 0 on port 60020" prio=10 tid=0x7f538c08b000 nid=0x4bd waiting 
> for monitor entry [0x7f5393be4000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.h

[jira] [Updated] (HBASE-3238) HBase needs to have the CREATE permission on the parent of its ZooKeeper parent znode

2011-03-25 Thread Alex Newman (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Newman updated HBASE-3238:
---

Attachment: HBASE-3238-v2.patch

> HBase needs to have the CREATE permission on the parent of its ZooKeeper 
> parent znode
> -
>
> Key: HBASE-3238
> URL: https://issues.apache.org/jira/browse/HBASE-3238
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.0
>Reporter: Mathias Herberts
>Assignee: Alex Newman
>Priority: Blocker
> Attachments: 1, HBASE-3238-v2.patch, HBASE-3238.patch
>
>
> Upon startup, HBase attempts to create its zookeeper.parent.znode in 
> ZooKeeper, it does so using ZKUtil.createAndFailSilent which as its name 
> seems to imply will fail silent if the znode exists. But if HBase does not 
> have the CREATE permission on its zookeeper.parent.znode parent znode then 
> the create attempt will fail with a 
> org.apache.zookeeper.KeeperException$NoAuthException and will terminate the 
> process.
> In a production environment where ZooKeeper has a managed namespace it is not 
> possible to give HBase CREATE permission on the parent of its parent znode.
> ZKUtil.createAndFailSilent should therefore be modified to check that the 
> znode exists using ZooKeeper.exist prior to attempting to create it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3238) HBase needs to have the CREATE permission on the parent of its ZooKeeper parent znode

2011-03-25 Thread Alex Newman (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011297#comment-13011297
 ] 

Alex Newman commented on HBASE-3238:


> Is there a create that works even if node already exists?
I talked to henry today, we either need to catch it(Which will catch a larger 
set of acl exceptions), or check exists first (which may have performance 
implications). Alternatively we can make our API more fine grained and replace 
all the create /home.dir calls with this new fine grained call. He suggested we 
catch this exception. Also do we think it should be createOrFailSilent not and 
fail silent?

> What is this '+ * @param str String to amend. -1'
Me being retarded, it's now removed.

> When you say above that it limits to write under /hbase, you don't mean 
> exactly /hbase, you mean whatever is configured as cluster home in zk?
Correct I clarified the comment.



> HBase needs to have the CREATE permission on the parent of its ZooKeeper 
> parent znode
> -
>
> Key: HBASE-3238
> URL: https://issues.apache.org/jira/browse/HBASE-3238
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.0
>Reporter: Mathias Herberts
>Assignee: Alex Newman
>Priority: Blocker
> Attachments: 1, HBASE-3238-v2.patch, HBASE-3238.patch
>
>
> Upon startup, HBase attempts to create its zookeeper.parent.znode in 
> ZooKeeper, it does so using ZKUtil.createAndFailSilent which as its name 
> seems to imply will fail silent if the znode exists. But if HBase does not 
> have the CREATE permission on its zookeeper.parent.znode parent znode then 
> the create attempt will fail with a 
> org.apache.zookeeper.KeeperException$NoAuthException and will terminate the 
> process.
> In a production environment where ZooKeeper has a managed namespace it is not 
> possible to give HBase CREATE permission on the parent of its parent znode.
> ZKUtil.createAndFailSilent should therefore be modified to check that the 
> znode exists using ZooKeeper.exist prior to attempting to create it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-3701) revisit ArrayList creation

2011-03-25 Thread Ted Yu (JIRA)
revisit ArrayList creation
--

 Key: HBASE-3701
 URL: https://issues.apache.org/jira/browse/HBASE-3701
 Project: HBase
  Issue Type: Improvement
Reporter: Ted Yu


I am attaching the file which lists the files where ArrayList() is called 
without specifying initial size.
We should identify which calls should use pre-sizing to boost performance.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HBASE-2418) add support for ZooKeeper authentication

2011-03-25 Thread Alex Newman (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Newman reassigned HBASE-2418:
--

Assignee: (was: Alex Newman)

Didn't mean to click that button

> add support for ZooKeeper authentication
> 
>
> Key: HBASE-2418
> URL: https://issues.apache.org/jira/browse/HBASE-2418
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Reporter: Patrick Hunt
>Priority: Critical
>
> Some users may run a ZooKeeper cluster in "multi tenant mode" meaning that 
> more than one client service would
> like to share a single ZooKeeper service instance (cluster). In this case the 
> client services typically want to protect
> their data (ZK znodes) from access by other services (tenants) on the 
> cluster. Say you are running HBase and Solr 
> and Neo4j, or multiple HBase instances, etc... having 
> authentication/authorization on the znodes is important for both 
> security and helping to ensure that services don't interact negatively (touch 
> each other's data).
> Today HBase does not have support for authentication or authorization. This 
> should be added to the HBase clients
> that are accessing the ZK cluster. In general it means calling addAuthInfo 
> once after a session is established:
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/ZooKeeper.html#addAuthInfo(java.lang.String,
>  byte[])
> with a user specific credential, often times this is a shared secret or 
> certificate. You may be able to statically configure this
> in some cases (config string or file to read from), however in my case in 
> particular you may need to access it programmatically,
> which adds complexity as the end user may need to load code into HBase for 
> accessing the credential.
> Secondly you need to specify a non "world" ACL when interacting with znodes 
> (create primarily):
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/data/ACL.html
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/ZooDefs.html
> Feel free to ping the ZooKeeper team if you have questions. It might also be 
> good to discuss with some 
> potential end users - in particular regarding how the end user can specify 
> the credential.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HBASE-2418) add support for ZooKeeper authentication

2011-03-25 Thread Alex Newman (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Newman reassigned HBASE-2418:
--

Assignee: Alex Newman

> add support for ZooKeeper authentication
> 
>
> Key: HBASE-2418
> URL: https://issues.apache.org/jira/browse/HBASE-2418
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Reporter: Patrick Hunt
>Assignee: Alex Newman
>Priority: Critical
>
> Some users may run a ZooKeeper cluster in "multi tenant mode" meaning that 
> more than one client service would
> like to share a single ZooKeeper service instance (cluster). In this case the 
> client services typically want to protect
> their data (ZK znodes) from access by other services (tenants) on the 
> cluster. Say you are running HBase and Solr 
> and Neo4j, or multiple HBase instances, etc... having 
> authentication/authorization on the znodes is important for both 
> security and helping to ensure that services don't interact negatively (touch 
> each other's data).
> Today HBase does not have support for authentication or authorization. This 
> should be added to the HBase clients
> that are accessing the ZK cluster. In general it means calling addAuthInfo 
> once after a session is established:
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/ZooKeeper.html#addAuthInfo(java.lang.String,
>  byte[])
> with a user specific credential, often times this is a shared secret or 
> certificate. You may be able to statically configure this
> in some cases (config string or file to read from), however in my case in 
> particular you may need to access it programmatically,
> which adds complexity as the end user may need to load code into HBase for 
> accessing the credential.
> Secondly you need to specify a non "world" ACL when interacting with znodes 
> (create primarily):
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/data/ACL.html
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/ZooDefs.html
> Feel free to ping the ZooKeeper team if you have questions. It might also be 
> good to discuss with some 
> potential end users - in particular regarding how the end user can specify 
> the credential.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3701) revisit ArrayList creation

2011-03-25 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-3701:
--

Attachment: ArrayList.txt

> revisit ArrayList creation
> --
>
> Key: HBASE-3701
> URL: https://issues.apache.org/jira/browse/HBASE-3701
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
> Attachments: ArrayList.txt
>
>
> I am attaching the file which lists the files where ArrayList() is called 
> without specifying initial size.
> We should identify which calls should use pre-sizing to boost performance.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3607) Cursor functionality for results generated by Coprocessors

2011-03-25 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011346#comment-13011346
 ] 

Gary Helmling commented on HBASE-3607:
--

I posted a review of this on review board:
https://review.cloudera.org/r/1624/

(We really need to get the damn email spam scoring fixed).

> Cursor functionality for results generated by Coprocessors
> --
>
> Key: HBASE-3607
> URL: https://issues.apache.org/jira/browse/HBASE-3607
> Project: HBase
>  Issue Type: New Feature
>  Components: coprocessors
>Reporter: Himanshu Vashishtha
> Attachments: patch-2.txt
>
>
> I tried to come up with a scanner like functionality for results generated by 
> coprocessors at region level. 
> This is just a poc, and it will be good to have your comments on it.
> It has support for both Incremental and In-memory Result sets. Attached is a 
> patch that has a test case for an incremental result (i.e., client receives a 
> cursorId from the CP core method, it instantiates a cursor object and 
> iterates over the result set. He can set a cache limit on the CursorCallable 
> object to reduce the number of rpc --> just like scanners.
> In its current state, it has some limitations too :)), like, it is region 
> specific only, i.e., one can instantiate and use cursor at one region only 
> (and that region is determined by the input row while instantiating the 
> cursor). I will try to expand it so that it can have atleast a sequential 
> access to other regions, but as I said, I want the opinion of experts to know 
> whether this approach really makes some sense or not.
> I have tested it with the inbuilt testing framework on my laptop only.
> It will be good if I copy the use case here in the description too:
> Test table has rows like:
>  /**
>* The scenario is that I have these rows keys in the test table:
>   'aaa-123'
>   'aaa-456'
>   'abc-111'
>   'abd-111'
>   'abd-222'
>   & I want to return:
>   ('aaa', 2)
>   ('abc', 1)
>   ('abd', 2)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HBASE-3694:
--

Attachment: Hbase-3694[r1085508]_4.patch

add 1 more reference overhead in the HRegion.
Update the patch.

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

2011-03-25 Thread Sean Sechrist (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Sechrist updated HBASE-3686:
-

Affects Version/s: 0.90.1
  Summary: ClientScanner skips too many rows on recovery if using 
scanner caching  (was: Scanner timeout on RegionServer but Client won't know 
what happened)

Updated title to be more accurate.

> ClientScanner skips too many rows on recovery if using scanner caching
> --
>
> Key: HBASE-3686
> URL: https://issues.apache.org/jira/browse/HBASE-3686
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.89.20100924, 0.90.1
>Reporter: Sean Sechrist
>Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: 
> http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we 
> can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's 
> hbase.regionserver.lease.period config, so it doesn't throw a 
> ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout 
> is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

2011-03-25 Thread Sean Sechrist (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Sechrist updated HBASE-3686:
-

Attachment: 3686.patch

Added patch that will set caching to 1 before getting the last row that should 
be skipped during recovery. Also added 2 unit tests to reproduce both 
situations (RS death and mismatched scanner timeouts).

> ClientScanner skips too many rows on recovery if using scanner caching
> --
>
> Key: HBASE-3686
> URL: https://issues.apache.org/jira/browse/HBASE-3686
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.89.20100924, 0.90.1
>Reporter: Sean Sechrist
>Priority: Minor
> Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: 
> http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we 
> can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's 
> hbase.regionserver.lease.period config, so it doesn't throw a 
> ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout 
> is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

2011-03-25 Thread Sean Sechrist (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Sechrist updated HBASE-3686:
-

Release Note: Fixed bug where rows would be skipped if region server dies 
during scan and scanner caching > 1
  Status: Patch Available  (was: Open)

> ClientScanner skips too many rows on recovery if using scanner caching
> --
>
> Key: HBASE-3686
> URL: https://issues.apache.org/jira/browse/HBASE-3686
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.1, 0.89.20100924
>Reporter: Sean Sechrist
>Priority: Minor
> Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: 
> http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we 
> can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's 
> hbase.regionserver.lease.period config, so it doesn't throw a 
> ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout 
> is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HBASE-3602) [hbck] Print out better info if detects overlapping regions in .META.

2011-03-25 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3602.
--

Resolution: Duplicate

Duplicate of HBASE-3695

> [hbck] Print out better info if detects overlapping regions in .META.
> -
>
> Key: HBASE-3602
> URL: https://issues.apache.org/jira/browse/HBASE-3602
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>
> Just had a case of what looked like lost edits from .META. resulting in 
> overlapping regions up in .META.  Needed online merge tool to fix (didn't 
> have it).  Also, finding the bad regions wasn't as easy as it could have 
> been.  hbck printed out cryptic message about count of regions not being 
> right, not enough 'edges'.  ./bin/check_meta.rb actually wrote out the 
> problem issues saying a hole in table.  Move this into hbck and kill 
> bin/check_meta.rb.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3695) Some improvements to Hbck to test the entire region chain in Meta and provide better error reporting

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011385#comment-13011385
 ] 

stack commented on HBASE-3695:
--

Resolved the dupe.  There is already an online merge too.  And yes, need to 
make it so region hacking in .META. is a bit easier, perhaps some facility in 
shell (Saw someone trying to do "put '.META.', 'STRING_VERSION_OF_HREGIONINFO', 
etc" which plain won't work, but, if you think a second, should work).

> Some improvements to Hbck to test the entire region chain in Meta and provide 
> better error reporting
> 
>
> Key: HBASE-3695
> URL: https://issues.apache.org/jira/browse/HBASE-3695
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 0.90.1
>Reporter: Marc Limotte
>Assignee: Marc Limotte
> Fix For: 0.90.3
>
>
> The current Hbck tool will miss some inconsistencies in Meta, and in other 
> cases will detect an issue, but does not provide much in the way of useful 
> feedback.  
> * Incorporate the full region chain tests (similar to check_meta.rb). I.e. 
> look for overlaps, holes and cycles. I believe check_meta.rb will be 
> redundant after this change.
> * More unit tests, and better tests that will test the actual error 
> discovered, instead of just errors true/false.
> * In the case of overlaps and holes, output both ends of the broken chain.
> * Previous implementation runs check() twice.  This is inefficient and, more 
> importantly, reports redundant errors which could be confusing to the user.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

2011-03-25 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3686:
-

  Resolution: Fixed
Assignee: Sean Sechrist
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Committed branch and trunk.  Nice one Sean (Made you a contributor and assigned 
you this issue)

> ClientScanner skips too many rows on recovery if using scanner caching
> --
>
> Key: HBASE-3686
> URL: https://issues.apache.org/jira/browse/HBASE-3686
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.89.20100924, 0.90.1
>Reporter: Sean Sechrist
>Assignee: Sean Sechrist
>Priority: Minor
> Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: 
> http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we 
> can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's 
> hbase.regionserver.lease.period config, so it doesn't throw a 
> ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout 
> is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3238) HBase needs to have the CREATE permission on the parent of its ZooKeeper parent znode

2011-03-25 Thread Alex Newman (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011398#comment-13011398
 ] 

Alex Newman commented on HBASE-3238:


I just realized what about we just catch the noauth exception, and throw 
another error if the node doesn't exist. This would save us an exist in the 
normal case?

> HBase needs to have the CREATE permission on the parent of its ZooKeeper 
> parent znode
> -
>
> Key: HBASE-3238
> URL: https://issues.apache.org/jira/browse/HBASE-3238
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.0
>Reporter: Mathias Herberts
>Assignee: Alex Newman
>Priority: Blocker
> Attachments: 1, HBASE-3238-v2.patch, HBASE-3238.patch
>
>
> Upon startup, HBase attempts to create its zookeeper.parent.znode in 
> ZooKeeper, it does so using ZKUtil.createAndFailSilent which as its name 
> seems to imply will fail silent if the znode exists. But if HBase does not 
> have the CREATE permission on its zookeeper.parent.znode parent znode then 
> the create attempt will fail with a 
> org.apache.zookeeper.KeeperException$NoAuthException and will terminate the 
> process.
> In a production environment where ZooKeeper has a managed namespace it is not 
> possible to give HBase CREATE permission on the parent of its parent znode.
> ZKUtil.createAndFailSilent should therefore be modified to check that the 
> znode exists using ZooKeeper.exist prior to attempting to create it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3701) revisit ArrayList creation

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011409#comment-13011409
 ] 

stack commented on HBASE-3701:
--

Can we presize any of these Ted?

> revisit ArrayList creation
> --
>
> Key: HBASE-3701
> URL: https://issues.apache.org/jira/browse/HBASE-3701
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
> Attachments: ArrayList.txt
>
>
> I am attaching the file which lists the files where ArrayList() is called 
> without specifying initial size.
> We should identify which calls should use pre-sizing to boost performance.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3654) Weird blocking between getOnlineRegion and createRegionLoad

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011408#comment-13011408
 ] 

stack commented on HBASE-3654:
--

I committed your addition Subbu and Ted, I did as you suggested in same commit:

{code}
Index: src/main/resources/hbase-webapps/regionserver/regionserver.jsp
===
--- src/main/resources/hbase-webapps/regionserver/regionserver.jsp  
(revision 1085553)
+++ src/main/resources/hbase-webapps/regionserver/regionserver.jsp  
(working copy)
@@ -56,7 +56,7 @@
  %>
 <%= r.getRegionNameAsString() %>
 <%= Bytes.toStringBinary(r.getStartKey()) %><%= 
Bytes.toStringBinary(r.getEndKey()) %>
-<%= load.toString() %>
+<%= load == null? "null": load.toString() %>
 
 <%   } %>
{code}

Thanks lads.  Good stuff.

> Weird blocking between getOnlineRegion and createRegionLoad
> ---
>
> Key: HBASE-3654
> URL: https://issues.apache.org/jira/browse/HBASE-3654
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: Subbu M Iyer
>Priority: Blocker
> Fix For: 0.90.2
>
> Attachments: ConcurrentHM, ConcurrentSKLM, CopyOnWrite, 
> HBASE-3654-ConcurrentHashMap-RemoveGetSync.patch, 
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL1.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM1.patch,
>  TestOnlineRegions.java, hashmap
>
>
> Saw this when debugging something else:
> {code}
> "regionserver60020" prio=10 tid=0x7f538c1c nid=0x4c7 runnable 
> [0x7f53931da000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1380)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionLoad(HRegionServer.java:916)
>   - locked <0x000672aa0a00> (a 
> java.util.concurrent.ConcurrentSkipListMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad(HRegionServer.java:767)
>   - locked <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:722)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:591)
>   at java.lang.Thread.run(Thread.java:662)
> "IPC Reader 9 on port 60020" prio=10 tid=0x7f538c1be000 nid=0x4c6 waiting 
> for monitor entry [0x7f53932db000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLevel(HBaseServer.java:1126)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:982)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:946)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>   - locked <0x000656e60068> (a 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> ...
> "IPC Reader 0 on port 60020" prio=10 tid=0x7f538c08b000 nid=0x4bd waiting 
> for monitor entry [0x7f5393be4000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.ha

[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011414#comment-13011414
 ] 

stack commented on HBASE-3694:
--

Patch looks good but I stumble when I come to this:

{code}
+  /**
+   * @return the global mem store size in the region server
+   */
+  public AtomicLong getGlobalMemstoreSize();
{code}

Here we are adding the getting of a single value to the RSS Interface.  RSS is 
usually about more macro-type services than single data member value.  Rare 
would the user of RSS be interested in this single value.  More useful i'd 
think would be if the RSS returned a class that allowed client a (read-only) 
view on multiple RS values; e.g. Above there is talk of a 
MemoryAccountingManager which I imagine would have this memstore size among 
other values.

We could change getRpcMetrics to be a generic getMetrics and it would return a 
RegionServerMetrics instance taht would include instance of HBaseRpcMetrics and 
current state of above counter?





> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HBASE-3693) isMajorCompaction() check triggers lots of listStatus DFS RPC calls from HBase

2011-03-25 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3693.
--

  Resolution: Fixed
Hadoop Flags: [Reviewed]

Committed to TRUNK (not to branch since only an 'improvement').  Nice patch 
Liyin.  Thanks.

> isMajorCompaction() check triggers lots of listStatus DFS RPC calls from HBase
> --
>
> Key: HBASE-3693
> URL: https://issues.apache.org/jira/browse/HBASE-3693
> Project: HBase
>  Issue Type: Improvement
>Reporter: Kannan Muthukkaruppan
>Assignee: Liyin Tang
> Attachments: Hbase-3693[r1085248]_2.patch, Hbase-3693[r1085306].patch
>
>
> We noticed that there are lots of listStatus calls on the ColumnFamily 
> directories within each region, coming from this codepath:
> {code}
> compactionSelection()
>  --> isMajorCompaction 
> --> getLowestTimestamp()
>-->  FileStatus[] stats = fs.listStatus(p);
> {code}
> So on every compactionSelection() we're taking this hit. While not 
> immediately an issue, just from log inspection, this accounts for quite a 
> large number of RPCs to namenode at the moment and seems like an unnecessary 
> load to be sending to the namenode.
> Seems like it would be easy to cache the timestamp for each opened/created 
> StoreFile, in memory, in the region server, and avoid going to DFS each time 
> for this information.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3238) HBase needs to have the CREATE permission on the parent of its ZooKeeper parent znode

2011-03-25 Thread Mathias Herberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011420#comment-13011420
 ] 

Mathias Herberts commented on HBASE-3238:
-

Catching the noauth would do the trick I guess, but what gain compared with the 
exist call when the znode has already been created?

> HBase needs to have the CREATE permission on the parent of its ZooKeeper 
> parent znode
> -
>
> Key: HBASE-3238
> URL: https://issues.apache.org/jira/browse/HBASE-3238
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.0
>Reporter: Mathias Herberts
>Assignee: Alex Newman
>Priority: Blocker
> Attachments: 1, HBASE-3238-v2.patch, HBASE-3238.patch
>
>
> Upon startup, HBase attempts to create its zookeeper.parent.znode in 
> ZooKeeper, it does so using ZKUtil.createAndFailSilent which as its name 
> seems to imply will fail silent if the znode exists. But if HBase does not 
> have the CREATE permission on its zookeeper.parent.znode parent znode then 
> the create attempt will fail with a 
> org.apache.zookeeper.KeeperException$NoAuthException and will terminate the 
> process.
> In a production environment where ZooKeeper has a managed namespace it is not 
> possible to give HBase CREATE permission on the parent of its parent znode.
> ZKUtil.createAndFailSilent should therefore be modified to check that the 
> znode exists using ZooKeeper.exist prior to attempting to create it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011421#comment-13011421
 ] 

Liyin Tang commented on HBASE-3694:
---

Thanks Stack.
I think adding globalMemstoreSize into RegionServerMetrics makes more sense 
than add a new class MemoryAccountingManager?

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3238) HBase needs to have the CREATE permission on the parent of its ZooKeeper parent znode

2011-03-25 Thread Alex Newman (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011436#comment-13011436
 ] 

Alex Newman commented on HBASE-3238:


I think the idea is to only call exist if we need to.

> HBase needs to have the CREATE permission on the parent of its ZooKeeper 
> parent znode
> -
>
> Key: HBASE-3238
> URL: https://issues.apache.org/jira/browse/HBASE-3238
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.0
>Reporter: Mathias Herberts
>Assignee: Alex Newman
>Priority: Blocker
> Attachments: 1, HBASE-3238-v2.patch, HBASE-3238.patch
>
>
> Upon startup, HBase attempts to create its zookeeper.parent.znode in 
> ZooKeeper, it does so using ZKUtil.createAndFailSilent which as its name 
> seems to imply will fail silent if the znode exists. But if HBase does not 
> have the CREATE permission on its zookeeper.parent.znode parent znode then 
> the create attempt will fail with a 
> org.apache.zookeeper.KeeperException$NoAuthException and will terminate the 
> process.
> In a production environment where ZooKeeper has a managed namespace it is not 
> possible to give HBase CREATE permission on the parent of its parent znode.
> ZKUtil.createAndFailSilent should therefore be modified to check that the 
> znode exists using ZooKeeper.exist prior to attempting to create it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011437#comment-13011437
 ] 

Todd Lipcon commented on HBASE-3694:


I don't want to conflate metrics (things that get exported for monitoring 
purposes) with internal accounting (things which are necessarily correct and 
up-to-date for proper functioning of the server).

Some internal accounting may be exposed as metrics, but the two subsystems are 
quite separate in my mind.

Does that make sense?

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3579) Spring cleaning in bin directory, a bunch of the scripts don't work anymore

2011-03-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011442#comment-13011442
 ] 

Todd Lipcon commented on HBASE-3579:


I think copytable.rb is also likely to be broken.

IMO we should remove these scripts entirely. They're not covered by test cases 
(which is why we've done several releases where they are entirely 
non-functional), they're barely easier to write than Java (harder to write if 
you don't know Ruby), and they're harder to integrate into Java-based workflows.

> Spring cleaning in bin directory, a bunch of the scripts don't work anymore
> ---
>
> Key: HBASE-3579
> URL: https://issues.apache.org/jira/browse/HBASE-3579
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>
> Need to do a review of bin contents.  A bunch of the scripts are stale now.   
> loadtable.rb needs to be removed, etc.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011444#comment-13011444
 ] 

Ted Yu commented on HBASE-3694:
---

memstoreSizeMB is a member of RegionServerMetrics and is set at 
hbase.regionserver.msginterval
See line 1162 in HRegionServer.java:
{code}
this.metrics.memstoreSizeMB.set((int) (memstoreSize / (1024 * 1024)));
{code}
memstoreSizeMB is of type MetricsIntValue which is a subclass of MetricsBase 
and stores value in:
{code}
  private int value;
{code}
We can create MetricsAtomicLongValue class with following signature:
{code}
public class MetricsAtomicLongValue extends MetricsBase{
  private AtomicLong value;  
  private boolean changed;
{code}
If we reach agreement on adding this method to RegionServerServices (which is 
available in HRegionServer and being used by MemStoreFlusher):
{code}
  /**
   * @return Region server metrics instance.
   */
  public RegionServerMetrics getMetrics() {
{code}

then we can change memstoreSizeMB to memstoreSize which is of type 
MetricsAtomicLongValue and blend Liyin's changes onto memstoreSize.

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011449#comment-13011449
 ] 

Liyin Tang commented on HBASE-3694:
---

The internal accounting makes sense. I just think MemoryAccountingManager is 
too specific.
We need something more general to reuse it in the future, 
RegionServerAccountingManager.

Thoughts?
Liyin

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011452#comment-13011452
 ] 

Jonathan Gray commented on HBASE-3694:
--

Do we really want to put things like this into RegionServerMetrics?  That class 
is a mess and is currently only used for the publishing of our metrics (not 
used for internal state tracking).  And we should avoid the hadoop Metrics* 
classes like the plague... heavily synchronized and generally confusing.

My vote would be to add a new class, maybe {{RegionServerHeapManager}} or 
something like that... might be a good opportunity to cleanup and centralize 
the code related to that.  But could just hold this one AtomicLong for now.  
Agree that adding a new interface method just for the long is not ideal since 
it buys us nothing down the road.  Better to add something new that we can use 
later.

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011457#comment-13011457
 ] 

Todd Lipcon commented on HBASE-3694:


+1 to jgray's suggestion. Please please please let's not conflate metrics and 
something that is crucial to correct operation.

In terms of overall design, I would love to see RegionServerServices evolve 
into something like an IOC container - it's just used to provide "wiring" 
between the different components that make up a running RS. That makes mocking 
easier and should help with general modularity.

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011461#comment-13011461
 ] 

stack commented on HBASE-3694:
--

bq. In terms of overall design, I would love to see RegionServerServices evolve 
into something like an IOC container

Yeah, thats the planNeed to keep it macro though.

Args on why this is not 'metrics' are good.  I go along.

Just say no to atomic long counters now we have cliff click counters in our 
CLASSPATH

bq. The internal accounting makes sense. I just think MemoryAccountingManager 
is too specific.
We need something more general to reuse it in the future, 
RegionServerAccountingManager.

Agreed.  Should be more than just about Memory accounting (and agree w/ Jon 
that it could be path out of our hairball HRegionServer class).  

For you Liyin and this patch, I think just make a class named 
RegionServerAccounting -- drop Manager I'd say, that might be a little 
megalomanicial -- and put just this one counter in it (as per Jon).  Add 
getRegionServerAccounting to RSS Interface.



> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011467#comment-13011467
 ] 

Todd Lipcon commented on HBASE-3694:


Sounds good to me.

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3655) Revision to HBase book, more examples in data model, more metrics, more performance

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011478#comment-13011478
 ] 

Hudson commented on HBASE-3655:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Revision to HBase book, more examples in data model, more metrics, more 
> performance 
> 
>
> Key: HBASE-3655
> URL: https://issues.apache.org/jira/browse/HBASE-3655
> Project: HBase
>  Issue Type: Improvement
>Reporter: Doug Meil
>Assignee: Doug Meil
>Priority: Minor
> Attachments: book.xml.patch
>
>
> This is a large-ish change to the HBase book
> - moving the 'data model' section forward in the book (before schema design)
> - adding some short API example snippets in 'data model' that illustrate the 
> points discussed
> - correcting a minor formatting issue in my last patch with pre-creating 
> regions.
> - included a listing of the region-server metrics
> ** however, I would like a one of the commiters to comment on what some of 
> these are.
> - Expanded MapReduce example
> -- added a few short example snippets
> - Expanded the performance section
> -- some of which was simply referencing other configuration topics brought up 
> elsewhere that are certain to cause people performance problems if 
> mis-configured.
> -- a few client examples of things that can cause performance problems if not 
> used properly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3654) Weird blocking between getOnlineRegion and createRegionLoad

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011477#comment-13011477
 ] 

Hudson commented on HBASE-3654:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])
HBASE-3654 Weird blocking between getOnlineRegion and createRegionLoad; 
Added more to Subbu and Ted extras


> Weird blocking between getOnlineRegion and createRegionLoad
> ---
>
> Key: HBASE-3654
> URL: https://issues.apache.org/jira/browse/HBASE-3654
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: Subbu M Iyer
>Priority: Blocker
> Fix For: 0.90.2
>
> Attachments: ConcurrentHM, ConcurrentSKLM, CopyOnWrite, 
> HBASE-3654-ConcurrentHashMap-RemoveGetSync.patch, 
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_COWAL1.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM.patch,
>  
> HBASE-3654_Weird_blocking_getOnlineRegions_and_createServerLoad_-_ConcurrentHM1.patch,
>  TestOnlineRegions.java, hashmap
>
>
> Saw this when debugging something else:
> {code}
> "regionserver60020" prio=10 tid=0x7f538c1c nid=0x4c7 runnable 
> [0x7f53931da000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1380)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionLoad(HRegionServer.java:916)
>   - locked <0x000672aa0a00> (a 
> java.util.concurrent.ConcurrentSkipListMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.buildServerLoad(HRegionServer.java:767)
>   - locked <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:722)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:591)
>   at java.lang.Thread.run(Thread.java:662)
> "IPC Reader 9 on port 60020" prio=10 tid=0x7f538c1be000 nid=0x4c6 waiting 
> for monitor entry [0x7f53932db000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLevel(HBaseServer.java:1126)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:982)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:946)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>   - locked <0x000656e60068> (a 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> ...
> "IPC Reader 0 on port 60020" prio=10 tid=0x7f538c08b000 nid=0x4bd waiting 
> for monitor entry [0x7f5393be4000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getFromOnlineRegions(HRegionServer.java:2295)
>   - waiting to lock <0x000656f62710> (a java.util.HashMap)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getOnlineRegion(HRegionServer.java:2307)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2333)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.isMetaRegion(HRegionServer.java:379)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:422)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$QosFunction.apply(HRegionServer.java:361)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer.getQosLevel(HBaseSe

[jira] [Commented] (HBASE-3669) Region in PENDING_OPEN keeps being bounced between RS and master

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011481#comment-13011481
 ] 

Hudson commented on HBASE-3669:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Region in PENDING_OPEN keeps being bounced between RS and master
> 
>
> Key: HBASE-3669
> URL: https://issues.apache.org/jira/browse/HBASE-3669
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Priority: Critical
> Fix For: 0.90.3, 0.92.0
>
> Attachments: HBASE-3669-debug-v1.patch
>
>
> After going crazy killing region servers after HBASE-3668, most of the 
> cluster recovered except for 3 regions that kept being refused by the region 
> servers.
> One the master I would see:
> {code}
> 2011-03-17 22:23:14,828 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  
> supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.
>  state=PENDING_OPEN, ts=1300400554826
> 2011-03-17 22:23:14,828 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_OPEN for too long, reassigning 
> region=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.
> 2011-03-17 22:23:14,828 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.
>  state=PENDING_OPEN, ts=1300400554826
> 2011-03-17 22:23:14,828 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
> was found (or we are ignoring an existing plan) for 
> supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.
>  so generated a random one; 
> hri=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.,
>  src=, dest=sv2borg171,60020,1300399357135; 17 (online=17, exclude=null) 
> available servers
> 2011-03-17 22:23:14,828 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.
>  to sv2borg171,60020,1300399357135
> {code}
> Then on the region server:
> {code}
> 2011-03-17 22:23:14,829 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:60020-0x22d627c142707d2 Attempting to transition node 
> f11849557c64c4efdbe0498f3fe97a21 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING
> 2011-03-17 22:23:14,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> regionserver:60020-0x22d627c142707d2 Retrieved 166 byte(s) of data from znode 
> /hbase/unassigned/f11849557c64c4efdbe0498f3fe97a21; 
> data=region=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.,
>  server=sv2borg180,60020,1300384550966, state=RS_ZK_REGION_OPENING
> 2011-03-17 22:23:14,832 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:60020-0x22d627c142707d2 Attempt to transition the unassigned 
> node for f11849557c64c4efdbe0498f3fe97a21 from M_ZK_REGION_OFFLINE to 
> RS_ZK_REGION_OPENING failed, the node existed but was in the state 
> RS_ZK_REGION_OPENING
> 2011-03-17 22:23:14,832 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
> transition from OFFLINE to OPENING for region=f11849557c64c4efdbe0498f3fe97a21
> {code}
> I'm not sure I fully understand what was going on... the master was suppose 
> to OFFLINE the znode but then that's not what the region server was seeing? 
> In any case, I was able to recover by doing a force unassign for each region 
> and then assign.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3474) HFileOutputFormat to use column family's compression algorithm

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011479#comment-13011479
 ] 

Hudson commented on HBASE-3474:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> HFileOutputFormat to use column family's compression algorithm
> --
>
> Key: HBASE-3474
> URL: https://issues.apache.org/jira/browse/HBASE-3474
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Affects Versions: 0.92.0
> Environment: All
>Reporter: Ashish Shinde
> Fix For: 0.92.0
>
> Attachments: hbase-3474.txt, hbase-3474.txt, patch3474.txt, 
> patch3474.txt, patch3474.txt
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> HFileOutputFormat  currently creates HFile writer's using a compression 
> algorithm set as configuration "hbase.hregion.max.filesize" with default as 
> no compression. The code does not take into account the compression algorithm 
> configured for the table's column family.  As a result bulk uploaded tables 
> are not compressed until a major compaction is run on them. This could be 
> fixed by using the column family descriptors while creating HFile writers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3541) REST Multi Gets

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011480#comment-13011480
 ] 

Hudson commented on HBASE-3541:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> REST Multi Gets
> ---
>
> Key: HBASE-3541
> URL: https://issues.apache.org/jira/browse/HBASE-3541
> Project: HBase
>  Issue Type: Improvement
>  Components: rest
>Reporter: Elliott Clark
>Assignee: Elliott Clark
>Priority: Minor
> Fix For: 0.92.0
>
> Attachments: HBASE-3541_1.patch, HBASE-3541_3.patch, 
> HBASE-3541_5.patch, multi_get_0.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Users currently using the REST interface do not have a way to ask for 
> multiple rows within one http call.
> For my use case I want to get a set of rows that I know the key before hand.  
> It's a very small percentage of my table and may not be contiguous so the 
> scanner is not the right use-case for me.  Currently the http overhead is the 
> largest percentage of my processing time.
> Ideally I'd like to create a patch that would act very similar to:
> GET /table/?row[]="rowkey"&row[]="rowkey_two"
> HTTP/1.1 200 OK
> { 
>"Rows":[ << Array of results equivalent to a single get >>]
> }
> This should be pretty backward compatible.  As it's just making the row key 
> into a query string.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3660) HMaster will exit when starting with stale data in cached locations such as -ROOT- or .META.

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011485#comment-13011485
 ] 

Hudson commented on HBASE-3660:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> HMaster will exit when starting with stale data in cached locations such as 
> -ROOT- or .META.
> 
>
> Key: HBASE-3660
> URL: https://issues.apache.org/jira/browse/HBASE-3660
> Project: HBase
>  Issue Type: Bug
>  Components: master, regionserver
>Affects Versions: 0.90.1
>Reporter: Cosmin Lehene
>Priority: Critical
> Fix For: 0.90.2
>
> Attachments: 3660.txt, HBASE-3660.patch
>
>
> later edit: I've mixed up two issues here. The main problem is that a client 
> (that could be HMaster) will read stale data from -ROOT- or .META. and not 
> deal correctly with the raised exceptions. 
> I've noticed this when the IP on my machine changed (it's even easier to 
> detect when LZO doesn't work)
> Master loads .META. successfully and then starts assigning regions.
> However LZO doesn't work so HRegionServer can't open the regions. 
> A client attempts to get data from a table so it reads the location from 
> .META. but goes to a totally different server (the old value in .META.)
> This could happen without the LZO story too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3681) Check the sloppiness of the region load before balancing

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011491#comment-13011491
 ] 

Hudson commented on HBASE-3681:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Check the sloppiness of the region load before balancing
> 
>
> Key: HBASE-3681
> URL: https://issues.apache.org/jira/browse/HBASE-3681
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: Ted Yu
> Fix For: 0.90.2
>
> Attachments: hbase-3681-v2.txt, hbase-3681-v3.txt, hbase-3681.txt
>
>
> Per our discussion at the hackathon today, it seems that it would be more 
> helpful to add a sloppiness check before doing the normal balancing.
> The current situation is that the balancer always tries to get the region 
> load even, meaning that there can be some very frequent regions movement.
> Setting the balancer to run less often (like every 4 hours) isn't much better 
> since the load could get out of whack easily.
> This is why running the normal balancer frequently, but first checking for 
> some sloppiness in the region load across the RS, seems like a more viable 
> option.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3497) TableMapReduceUtil.initTableReducerJob broken due to setConf method in TableOutputFormat

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011483#comment-13011483
 ] 

Hudson commented on HBASE-3497:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> TableMapReduceUtil.initTableReducerJob broken due to setConf method in 
> TableOutputFormat
> 
>
> Key: HBASE-3497
> URL: https://issues.apache.org/jira/browse/HBASE-3497
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.0
>Reporter: Amandeep Khurana
>Assignee: Jean-Daniel Cryans
> Fix For: 0.90.2
>
> Attachments: HBASE-3497.patch
>
>
> setConf() method in TableOutputFormat gets called and it replaces the 
> hbase.zookeeper.quorum address in the job conf xml when you run a CopyTable 
> job from one cluster to another. The conf gets set to the peer.addr that is 
> specified, which makes the job read and write from/to the peer cluster 
> instead of reading from the original cluster and writing to the peer.
> Possibly caused due to the change in 
> https://issues.apache.org/jira/browse/HBASE-3111

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011484#comment-13011484
 ] 

Hudson commented on HBASE-3686:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])
HBASE-3686 ClientScanner skips too many rows on recovery if using scanner 
caching


> ClientScanner skips too many rows on recovery if using scanner caching
> --
>
> Key: HBASE-3686
> URL: https://issues.apache.org/jira/browse/HBASE-3686
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.89.20100924, 0.90.1
>Reporter: Sean Sechrist
>Assignee: Sean Sechrist
>Priority: Minor
> Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: 
> http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we 
> can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's 
> hbase.regionserver.lease.period config, so it doesn't throw a 
> ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout 
> is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3688) Setters of class HTableDescriptor do not work properly

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011487#comment-13011487
 ] 

Hudson commented on HBASE-3688:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Setters of class HTableDescriptor do not work properly
> --
>
> Key: HBASE-3688
> URL: https://issues.apache.org/jira/browse/HBASE-3688
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.1
> Environment: Any
>Reporter: Adrián Romero
> Fix For: 0.92.0
>
> Attachments: HTableDescriptor-fix.diff, HTableDescriptorTest.java
>
>
> Setters "setName()" and "setDeferredLogFlush()" do not work properly because 
> for example after calling setName() the internal property nameAsString is not 
> modified and then if you call getNameAsString() you get the previous value 
> and not the new one. Something similar happens to the setter 
> "setDeferredLogFlush()"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3658) Alert when heap is over committed

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011489#comment-13011489
 ] 

Hudson commented on HBASE-3658:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Alert when heap is over committed
> -
>
> Key: HBASE-3658
> URL: https://issues.apache.org/jira/browse/HBASE-3658
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: Subbu M Iyer
> Fix For: 0.92.0
>
> Attachments: HBASE-3658_Alert_when_heap_is_over_committed.patch
>
>
> Something I just witnessed, the block cache setting was at 70% but the max 
> global memstore size was at the default of 40% meaning that 110% of the heap 
> can potentially be "assigned" and then you need more heap to do stuff like 
> flushing and compacting.
> We should run a configuration check that alerts the user when that happens 
> and maybe even refuse to start.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3622) Deadlock in HBaseServer (JVM bug?)

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011482#comment-13011482
 ] 

Hudson commented on HBASE-3622:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Deadlock in HBaseServer (JVM bug?)
> --
>
> Key: HBASE-3622
> URL: https://issues.apache.org/jira/browse/HBASE-3622
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Priority: Critical
> Fix For: 0.92.0
>
> Attachments: HBASE-3622.patch
>
>
> On Dmitriy's cluster:
> {code}
> "IPC Reader 0 on port 60020" prio=10 tid=0x2aacb4a82800 nid=0x3a72 
> waiting on condition [0x429ba000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x2aaabf5fa6d0> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
> at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
> at 
> java.util.concurrent.LinkedBlockingQueue.signalNotEmpty(LinkedBlockingQueue.java:103)
> at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:267)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:985)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:946)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
> - locked <0x2aaabf580fb0> (a 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> ...
> "IPC Server handler 29 on 60020" daemon prio=10 tid=0x2aacbc163800 
> nid=0x3acc waiting on condition [0x462f3000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x2aaabf5e3800> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1025)
> "IPC Server handler 28 on 60020" daemon prio=10 tid=0x2aacbc161800 
> nid=0x3acb waiting on condition [0x461f2000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x2aaabf5e3800> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1025
> ...
> {code}
> This region server stayed in this state for hours. The reader is waiting to 
> put and the handlers are waiting to take, and they wait on different lock 
> ids. It reminds me of the UseMembar thing about the JVM sometime missing to 
> notify waiters. In any case, that RS needed to be closed in order to get out 
> of that state. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3696) No durability when running with LocalFileSystem

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011488#comment-13011488
 ] 

Hudson commented on HBASE-3696:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])
HBASE-3696 isMajorCompaction() check triggers lots of listStatus DFS RPC 
calls from HBase


> No durability when running with LocalFileSystem
> ---
>
> Key: HBASE-3696
> URL: https://issues.apache.org/jira/browse/HBASE-3696
> Project: HBase
>  Issue Type: Bug
>  Components: documentation, wal
>Reporter: Todd Lipcon
>
> LocalFileSystem in Hadoop doesn't currently implement sync(), so when we're 
> running in that case, we don't have any durability. This isn't a huge deal 
> since it isn't a realistic deployment scenario, but it's probably worth 
> documenting. It caused some confusion for a user when a table disappeared 
> after killing a standalone instance that was hosting its data in the local FS.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3673) Reduce HTable Pool Contention Using Concurrent Collections

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011486#comment-13011486
 ] 

Hudson commented on HBASE-3673:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Reduce HTable Pool Contention Using Concurrent Collections
> --
>
> Key: HBASE-3673
> URL: https://issues.apache.org/jira/browse/HBASE-3673
> Project: HBase
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 0.90.1, 0.90.2
>Reporter: Karthick Sankarachary
>Assignee: Karthick Sankarachary
>Priority: Minor
> Fix For: 0.92.0
>
> Attachments: HBASE-3673-TESTCASE.patch, HBASE-3673.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the case of medium-to-large sized HTable pools, the amount of time the 
> client spends blocking on the underlying map and queue data structures turns 
> out to be quite significant. Using an efficient wait-free implementation of 
> maps and queues might serve to reduce the contention on the pool. In 
> particular, I was wondering if we should replace the synchronized map with a 
> concurrent hash map, and linked list with a concurrent linked queue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3627) NPE in EventHandler when region already reassigned

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011490#comment-13011490
 ] 

Hudson commented on HBASE-3627:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> NPE in EventHandler when region already reassigned
> --
>
> Key: HBASE-3627
> URL: https://issues.apache.org/jira/browse/HBASE-3627
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Jean-Daniel Cryans
>Assignee: stack
>Priority: Critical
> Fix For: 0.90.2
>
> Attachments: 3627.txt
>
>
> When a region takes too long to open, it will try to update the unassigned 
> znode and will fail on an ugly NPE like this:
> {quote}
> DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:60020-0x22dc571dde04ca7 Attempting to transition node 
> 0519dc3b62a569347526875048c37faa from RS_ZK_REGION_OPENING to 
> RS_ZK_REGION_OPENING
> DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> regionserver:60020-0x22dc571dde04ca7 Unable to get data of znode 
> /hbase/unassigned/0519dc3b62a569347526875048c37faa because node does not 
> exist (not necessarily an error)
> ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while 
> processing event M_RS_OPEN_REGION
> java.lang.NullPointerException
>   at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
>   at 
> org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:585)
>   at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:322)
>   at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:97)
>   at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> {quote}
> I think the region server in this case should be closing the region ASAP.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3683) NMapInputFormat should use a different config param for number of maps

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011493#comment-13011493
 ] 

Hudson commented on HBASE-3683:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> NMapInputFormat should use a different config param for number of maps
> --
>
> Key: HBASE-3683
> URL: https://issues.apache.org/jira/browse/HBASE-3683
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.90.2
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 0.90.2
>
> Attachments: hbase-3683.txt
>
>
> Annoyingly, the MR local runner drops the mapred.map.tasks parameter before 
> running a job. Should use a different config parameter so we can specify it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3052) Add ability to have multiple ZK servers in a quorum in MiniZooKeeperCluster for test writing

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011494#comment-13011494
 ] 

Hudson commented on HBASE-3052:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> Add ability to have multiple ZK servers in a quorum in MiniZooKeeperCluster 
> for test writing
> 
>
> Key: HBASE-3052
> URL: https://issues.apache.org/jira/browse/HBASE-3052
> Project: HBase
>  Issue Type: Improvement
>  Components: test, zookeeper
>Reporter: Jonathan Gray
>Assignee: Liyin Tang
>Priority: Minor
> Attachments: HBASE_3052[r1083993].patch, HBASE_3052[r1084033].patch
>
>
> Interesting things can happen when you have a ZK quorum of multiple servers 
> and one of them dies.  Doing testing here on clusters, this has turned up 
> some bugs with HBase interaction with ZK.
> Would be good to add the ability to have multiple ZK servers in unit tests 
> and be able to kill them individually.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3666) TestScannerTimeout fails occasionally

2011-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011492#comment-13011492
 ] 

Hudson commented on HBASE-3666:
---

Integrated in HBase-TRUNK #1814 (See 
[https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])


> TestScannerTimeout fails occasionally
> -
>
> Key: HBASE-3666
> URL: https://issues.apache.org/jira/browse/HBASE-3666
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.1
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 0.90.2
>
> Attachments: hbase-3666.txt
>
>
> If I loop TestScannerTimeout, it eventually fails with:
> org.apache.hadoop.hbase.regionserver.LeaseException: 
> org.apache.hadoop.hbase.regionserver.LeaseException: lease 
> '-4526340287831625207' does not exist
> at 
> org.apache.hadoop.hbase.regionserver.Leases.cancelLease(Leases.java:209)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1816)
> ...
> at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:83)
> at 
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:38)
> at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1003)
> at 
> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1103)
> at 
> org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1175)
> at 
> org.apache.hadoop.hbase.client.TestScannerTimeout.test2772(TestScannerTimeout.java:133)
> I think the issue is a race where at the top of the function, the scanner 
> does exist, but by the time it gets to cancelLease, it has timed out.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-1364) [performance] Distributed splitting of regionserver commit logs

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011499#comment-13011499
 ] 

stack commented on HBASE-1364:
--

@Prakash I posted a review of about 50% of your patch over on rb.  It looks 
great so far.

> [performance] Distributed splitting of regionserver commit logs
> ---
>
> Key: HBASE-1364
> URL: https://issues.apache.org/jira/browse/HBASE-1364
> Project: HBase
>  Issue Type: Improvement
>  Components: coprocessors
>Reporter: stack
>Assignee: Alex Newman
>Priority: Critical
> Fix For: 0.92.0
>
> Attachments: HBASE-1364.patch
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> HBASE-1008 has some improvements to our log splitting on regionserver crash; 
> but it needs to run even faster.
> (Below is from HBASE-1008)
> In bigtable paper, the split is distributed. If we're going to have 1000 
> logs, we need to distribute or at least multithread the splitting.
> 1. As is, regions starting up expect to find one reconstruction log only. 
> Need to make it so pick up a bunch of edit logs and it should be fine that 
> logs are elsewhere in hdfs in an output directory written by all split 
> participants whether multithreaded or a mapreduce-like distributed process 
> (Lets write our distributed sort first as a MR so we learn whats involved; 
> distributed sort, as much as possible should use MR framework pieces). On 
> startup, regions go to this directory and pick up the files written by split 
> participants deleting and clearing the dir when all have been read in. Making 
> it so can take multiple logs for input, can also make the split process more 
> robust rather than current tenuous process which loses all edits if it 
> doesn't make it to the end without error.
> 2. Each column family rereads the reconstruction log to find its edits. Need 
> to fix that. Split can sort the edits by column family so store only reads 
> its edits.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HBASE-2418) add support for ZooKeeper authentication

2011-03-25 Thread Eugene Koontz (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koontz reassigned HBASE-2418:


Assignee: Eugene Koontz

> add support for ZooKeeper authentication
> 
>
> Key: HBASE-2418
> URL: https://issues.apache.org/jira/browse/HBASE-2418
> Project: HBase
>  Issue Type: Improvement
>  Components: master, regionserver
>Reporter: Patrick Hunt
>Assignee: Eugene Koontz
>Priority: Critical
>
> Some users may run a ZooKeeper cluster in "multi tenant mode" meaning that 
> more than one client service would
> like to share a single ZooKeeper service instance (cluster). In this case the 
> client services typically want to protect
> their data (ZK znodes) from access by other services (tenants) on the 
> cluster. Say you are running HBase and Solr 
> and Neo4j, or multiple HBase instances, etc... having 
> authentication/authorization on the znodes is important for both 
> security and helping to ensure that services don't interact negatively (touch 
> each other's data).
> Today HBase does not have support for authentication or authorization. This 
> should be added to the HBase clients
> that are accessing the ZK cluster. In general it means calling addAuthInfo 
> once after a session is established:
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/ZooKeeper.html#addAuthInfo(java.lang.String,
>  byte[])
> with a user specific credential, often times this is a shared secret or 
> certificate. You may be able to statically configure this
> in some cases (config string or file to read from), however in my case in 
> particular you may need to access it programmatically,
> which adds complexity as the end user may need to load code into HBase for 
> accessing the credential.
> Secondly you need to specify a non "world" ACL when interacting with znodes 
> (create primarily):
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/data/ACL.html
> http://hadoop.apache.org/zookeeper/docs/current/api/org/apache/zookeeper/ZooDefs.html
> Feel free to ping the ZooKeeper team if you have questions. It might also be 
> good to discuss with some 
> potential end users - in particular regarding how the end user can specify 
> the credential.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HBASE-3694:
--

Attachment: Hbase-3694[r1085593]_5.patch

Agreed with stack
Add a new class: RegionServerAccouting.


> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch, 
> Hbase-3694[r1085593]_5.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-3702) Exec throws a npe while writing a method that has a null value argument

2011-03-25 Thread Himanshu Vashishtha (JIRA)
Exec throws a npe while writing a method that has a null value argument
---

 Key: HBASE-3702
 URL: https://issues.apache.org/jira/browse/HBASE-3702
 Project: HBase
  Issue Type: Bug
  Components: coprocessors
Reporter: Himanshu Vashishtha


Exec write method invokes getClass() on its arguments list for finding the 
argument's class, which gives a npe in case the argument is null. There is 
already an parameterClasses array in Invoker (its super class), which is 
populated with correct values (by method.getParameterTypes()). One can use it 
this array.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3702) Exec throws a npe while writing a method that has a null value argument

2011-03-25 Thread Himanshu Vashishtha (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Himanshu Vashishtha updated HBASE-3702:
---

Description: Exec write method invokes getClass() on its arguments list for 
finding the argument's class, which gives a npe in case the argument is null. 
There is already an parameterClasses array in Invoker (its super class), which 
is populated with correct values (by method.getParameterTypes()). One can use 
this array.  (was: Exec write method invokes getClass() on its arguments list 
for finding the argument's class, which gives a npe in case the argument is 
null. There is already an parameterClasses array in Invoker (its super class), 
which is populated with correct values (by method.getParameterTypes()). One can 
use it this array.)

> Exec throws a npe while writing a method that has a null value argument
> ---
>
> Key: HBASE-3702
> URL: https://issues.apache.org/jira/browse/HBASE-3702
> Project: HBase
>  Issue Type: Bug
>  Components: coprocessors
>Reporter: Himanshu Vashishtha
>
> Exec write method invokes getClass() on its arguments list for finding the 
> argument's class, which gives a npe in case the argument is null. There is 
> already an parameterClasses array in Invoker (its super class), which is 
> populated with correct values (by method.getParameterTypes()). One can use 
> this array.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011561#comment-13011561
 ] 

stack commented on HBASE-3694:
--

Please do not use HBaseClusterTestCase as basis for your test.  Its been 
deprecated ' * @deprecated Use junit4 and {@link HBaseTestingUtility}'.  Sorry 
about that.  We should have made sure you got the memo on that one.  The 
alternative HBaseTestingUtility has cleaner means of creating multiregion 
table. Fix copyright on your test -- also, the javadoc is copy/pasted from 
elsewhere -- and in your accounting class. Its 2011!  RegionServerAccounting 
needs a bit of class javadoc to say what the class is for.  I'd write 'private 
final AtomicLong atomicGlobalMemstoreSize = new AtomicLong(0);' rather than 
wait to assign in the Constructor (no need for a constructor then).  I'd rename 
incGlobalMemstoreSize as addAndGetGlobalMemstoreSize as in AtomicLong and I'd 
return the current value as per AtomicLong (why not?).  I'd also call it 
getAndAddMemstoreSize rather than incMemoryUsage.

Otherwise the patch looks great Liyin.  Thanks for doing this.

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch, 
> Hbase-3694[r1085593]_5.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3694) high multiput latency due to checking global mem store size in a synchronized function

2011-03-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011570#comment-13011570
 ] 

Todd Lipcon commented on HBASE-3694:


I don't think we should return the current value from the increment call unless 
it's necessary. For striped counters and such, a "blind" increment can often be 
cheaper than an increment-and-get. Isn't this the case with the Cliff Click 
Counters?

> high multiput latency due to checking global mem store size in a synchronized 
> function
> --
>
> Key: HBASE-3694
> URL: https://issues.apache.org/jira/browse/HBASE-3694
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: Hbase-3694[r1085306], Hbase-3694[r1085306]_2.patch, 
> Hbase-3694[r1085306]_3.patch, Hbase-3694[r1085508]_4.patch, 
> Hbase-3694[r1085593]_5.patch
>
>
> The problem is we found the multiput latency is very high.
> In our case, we have almost 22 Regions in each RS and there are no flush 
> happened during these puts.
> After investigation, we believe that the root cause is the function 
> getGlobalMemStoreSize, which is to check the high water mark of mem store. 
> This function takes almost 40% of total execution time of multiput when 
> instrumenting some metrics in the code.  
> The actual percentage may be more higher. The execution time is spent on 
> synchronize contention.
> One solution is to keep a static var in HRegion to keep the global MemStore 
> size instead of calculating them every time.
> Why using static variable?
> Since all the HRegion objects in the same JVM share the same memory heap, 
> they need to share fate as well.
> The static variable, globalMemStroeSize, naturally shows the total mem usage 
> in this shared memory heap for this JVM.
> If multiple RS need to run in the same JVM, they still need only one 
> globalMemStroeSize.
> If multiple RS run on different JVMs, everything is fine.
> After changing, in our cases, the avg multiput latency decrease from 60ms to 
> 10ms.
> I will submit a patch based on the current trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-1476) scaling compaction with multiple threads

2011-03-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011578#comment-13011578
 ] 

stack commented on HBASE-1476:
--

I did a first pass review N over on RB.  Good stuff.

> scaling compaction with multiple threads
> 
>
> Key: HBASE-1476
> URL: https://issues.apache.org/jira/browse/HBASE-1476
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Billy Pearson
>Assignee: Nicolas Spiegelberg
>  Labels: moved_from_0_20_5
> Fix For: 0.92.0
>
>
> Was thinking we should build in support to be able to handle more then one 
> thread for compactions this will allow us to keep up with compactions when we 
> get to the point where we store Tb's of data per node and may regions
> Maybe a configurable setting to set how many threads a region server can use 
> for compactions.
> With compression turned on my compactions are limited by cpu speed with multi 
> cores then it would be nice to be able to scale compactions to 2 or more 
> cores.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira