[jira] Assigned: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread James Kennedy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Kennedy reassigned HBASE-3524:


Assignee: James Kennedy

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.0
>Reporter: James Kennedy
>Assignee: James Kennedy
>Priority: Blocker
> Fix For: 0.90.1, 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread ryan rawson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ryan rawson updated HBASE-3524:
---

  Component/s: regionserver
 Priority: Blocker  (was: Major)
Affects Version/s: 0.90.0
Fix Version/s: 0.90.1

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.0
>Reporter: James Kennedy
>Priority: Blocker
> Fix For: 0.90.1, 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread James Kennedy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993365#comment-12993365
 ] 

James Kennedy commented on HBASE-3524:
--

> What is the creation time of your empty file? When is it from? Maybe it's old?

Let me re-reproduce these issues from scratch tomorrow morning.

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993343#comment-12993343
 ] 

ryan rawson commented on HBASE-3524:


compaction is "optional", meaning if it fails no data is lost, so you
should probably be fine.

Older versions of the code did not write out time tracker data and
that is why your older files were giving you npes.


> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread James Kennedy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993341#comment-12993341
 ] 

James Kennedy commented on HBASE-3524:
--

This patch obviously stops the npe and allows compaction checking to follow 
through.

Furthermore I added a log output line that indicates when/what stores have 
.timeRangeTracker == null when encountered.  It seemed that 7 or 8 tables (out 
of 50) had this problem and when i forced their major compaction from the hbase 
shell they stopped reporting the error.

So it looks like the major compactions created new stores with timeRangeTracker 
properly.

I'm still concerned though about how this happened in the first place and I 
need to do some thorough testing of the data to ensure nothing was lost.

Ryan, in your opinion do you think this data is likely to have survived 
corruption?

And thanks for your speedy help.

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993328#comment-12993328
 ] 

ryan rawson commented on HBASE-3524:


try this patch:

diff --git a/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 
b/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
index d7e3ce3..519111a 100644
--- a/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
+++ b/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
@@ -829,7 +829,10 @@ public class Store implements HeapSize {
   if (filesToCompact.size() == 1) {
 // Single file
 StoreFile sf = filesToCompact.get(0);
-long oldest = now - sf.getReader().timeRangeTracker.minimumTimestamp;
+long oldest =
+(sf.getReader().timeRangeTracker == null) ?
+Long.MIN_VALUE :
+now - sf.getReader().timeRangeTracker.minimumTimestamp;
 if (sf.isMajorCompaction() &&
 (this.ttl == HConstants.FOREVER || oldest < this.ttl)) {
   if (LOG.isDebugEnabled()) {

no test yet! doh!

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993327#comment-12993327
 ] 

ryan rawson commented on HBASE-3524:


the issue is that if the hfile does not have timerangeBytes, this code doesn't 
trigger:

(StoreFile.java)
  if (timerangeBytes != null) {
this.reader.timeRangeTracker = new TimeRangeTracker();
Writables.copyWritable(timerangeBytes, this.reader.timeRangeTracker);
  }

And timeRangeTracker remains null.

But this code doesnt check for null:

(Store.java)
832long oldest = now - sf.getReader().timeRangeTracker.minimumTimestamp;


if timeRangeTracker is null, we should probably use Integer.MIN_VALUE for 
minimumTimestamp.

What is the creation time of your empty file? When is it from? Maybe it's old?

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread James Kennedy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993326#comment-12993326
 ] 

James Kennedy commented on HBASE-3524:
--

Thanks. I'm in a bit of a pickle. Though I tested all upgrades on QA and test 
data, this issue has only cropped up on a production deploy. Since our 
production app appeared to be running smoothly we gave it a +1 and there is 
already new user data in there. I'm wondering if I should revert to older data 
anyway (some user data loss) until this corruption is handled...

Shouldn't 0.90.0 automatically upgrade old data?

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread James Kennedy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993320#comment-12993320
 ] 

James Kennedy commented on HBASE-3524:
--

I found this in the hbase.log:


[10/02/11 18:37:29] 44386  [1297391814420-0] WARN  
adoop.hbase.regionserver.Store  - Skipping 
hdfs://localhost:7701/hbase/.META./1028785192/info/2685681686584745388 because 
its empty. HBASE-646 DATA LOSS?

So perhaps this issue is a symptom of corrupt meta data. HOW can I fix this!?

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993321#comment-12993321
 ] 

ryan rawson commented on HBASE-3524:


Old files causing new code to break it seems. Good job tracking it down!


> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread James Kennedy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993318#comment-12993318
 ] 

James Kennedy commented on HBASE-3524:
--

Did some more debugging and got a little more intel:  What's null on that line 
is sf.getReader().timeRangeTracker.

It seems to be consistently null for many if not all tables.  Anyone know how 
this could happen?

> NPE from CompactionChecker
> --
>
> Key: HBASE-3524
> URL: https://issues.apache.org/jira/browse/HBASE-3524
> Project: HBase
>  Issue Type: Bug
>Reporter: James Kennedy
> Fix For: 0.90.2
>
>
> I recently updated production data to use HBase 0.90.0.
> Now I'm periodically seeing:
> [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
> nServer$MajorCompactionChecker  - Caught exception
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
>   at 
> org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
>   at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> The only negative effect is that this is interrupting compactions from 
> happening. But that is pretty serious and this might be a sign of data 
> corruption?
> Maybe it's just my data, but this task should at least involve improving the 
> handling to catch the NPE and still iterate through the other onlineRegions 
> that might compact without error.  The MajorCompactionChecker.chore() method 
> only catches IOExceptions and so this NPE breaks out of that loop. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (HBASE-3524) NPE from CompactionChecker

2011-02-10 Thread James Kennedy (JIRA)
NPE from CompactionChecker
--

 Key: HBASE-3524
 URL: https://issues.apache.org/jira/browse/HBASE-3524
 Project: HBase
  Issue Type: Bug
Reporter: James Kennedy
 Fix For: 0.90.2


I recently updated production data to use HBase 0.90.0.
Now I'm periodically seeing:

[10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR 
nServer$MajorCompactionChecker  - Caught exception
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832)
at 
org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810)
at 
org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047)
at org.apache.hadoop.hbase.Chore.run(Chore.java:66)

The only negative effect is that this is interrupting compactions from 
happening. But that is pretty serious and this might be a sign of data 
corruption?

Maybe it's just my data, but this task should at least involve improving the 
handling to catch the NPE and still iterate through the other onlineRegions 
that might compact without error.  The MajorCompactionChecker.chore() method 
only catches IOExceptions and so this NPE breaks out of that loop. 


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (HBASE-1502) Remove need for heartbeats in HBase

2011-02-10 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-1502:
-

Priority: Blocker  (was: Critical)
Assignee: stack

Made this a blocker.  Need this to fix up the bug where we'd see splits before 
region had successfully opened.

> Remove need for heartbeats in HBase
> ---
>
> Key: HBASE-1502
> URL: https://issues.apache.org/jira/browse/HBASE-1502
> Project: HBase
>  Issue Type: Task
>Reporter: Nitay Joffe
>Assignee: stack
>Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: 1502.txt
>
>
> HBase currently uses heartbeats between region servers and the master, 
> piggybacking information on them when it can. This issue is to investigate if 
> we can get rid of the need for those using ZooKeeper events.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (HBASE-1502) Remove need for heartbeats in HBase

2011-02-10 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-1502:
-

Attachment: 1502.txt

Here is a start -- the fun part.  Removed HMsg and then redid the 
HMasterRegionInterface to remove the heartbeating.  Was able to remove a bunch 
of other crud from Master and Regionserver.  TODO: Splits, shutdown, and load.  
The report into the master by RS remains.  It'll be passed back the hostname to 
use.  It'll add this up into its znode (it'll only add znode after it has 
registered w/ Master).  The znode will then be updated by the RS on a period 
with its load info.  Master loadbalancer will read it from there.  Regards 
Splits, after talking w/ Jon, its a new RIT state.  Shutdown, I'll have think 
about it and write something up.

> Remove need for heartbeats in HBase
> ---
>
> Key: HBASE-1502
> URL: https://issues.apache.org/jira/browse/HBASE-1502
> Project: HBase
>  Issue Type: Task
>Reporter: Nitay Joffe
>Priority: Critical
> Fix For: 0.92.0
>
> Attachments: 1502.txt
>
>
> HBase currently uses heartbeats between region servers and the master, 
> piggybacking information on them when it can. This issue is to investigate if 
> we can get rid of the need for those using ZooKeeper events.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3523) Rewrite our client

2011-02-10 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993279#comment-12993279
 ] 

Jonathan Gray commented on HBASE-3523:
--

A binary, language agnostic underlying RPC and wire protocol.  Async, as an 
option, would be nice as well.

I'd like more visibility and control into what is happening underneath with 
respect to connections to RegionServers and such.  I don't like all the 
staticness and voodoo magic, at least not as the only option.  The usage of 
like a hash of Configuration has always been weird to me.

A better API for how errors are returned, for example, I can never understand 
how the MultiAction stuff without digging into code.

+1 to your suggestions.  We can already do stuff off the back of ZK for region 
movement if we wanted, but the opportunity for little hints in RPCs would be 
neat as well.

Thanks for filing this stack.

> Rewrite our client
> --
>
> Key: HBASE-3523
> URL: https://issues.apache.org/jira/browse/HBASE-3523
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: stack
>
> Is it just me or do others sense that there is pressure building to redo the 
> client?  If just me, ignore the below... I'll just keep notes in here.  
> Otherwise, what would the requirements for a client rewrite look like?
> + Let out InterruptedException
> + Enveloping of messages or space for metadata that can be passed by client 
> to server and by server to client; e.g. the region a.b.c moved to server 
> x.y.z. or scanner is finished or timeout
> + A different RPC? One with tighter serialization.
> + More sane timeout/retry policy.
> Does it have to support async communication?  Do callbacks?
> What else?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3523) Rewrite our client

2011-02-10 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993276#comment-12993276
 ] 

ryan rawson commented on HBASE-3523:


Things that are issues:

- the use of proxy means that the interfaces _must have_ InterruptedException 
on the interface, or else you get "undeclared throwable exception", but now you 
are conflating a business contract (the interfaces) and networking/execution 
realities. Futhermore going through a proxy object isn't necessary, its just 
more layers, since few people directly code against the interfaces.
- multiple level of timeouts causes unnecessary confusion. Also the retry loops 
in HCM cause confusion and issues.
- client should support parallelism more directly, no more thread pools that 
just sleep!
- lots of callables make the code harder to read, either get rid of them or use 
more inner classes. Jumping around files makes for difficult comprehension.

Some good things:
- the base socket handling is actually in good shape. 1 socket per client-rs 
pair is about where we want to be.
- multiplexing requests on the same socket is good, not spawning extra threads 
server side just to handle more clients is also good. since every client will 
have an open socket to at least the META region, this is very important!
- the handler pool is a natural side effect of the previous point, unbounding 
it might not be a good idea.

Other constraints:
- we will want to provide an efficient blocking API, it's what is expected.
- an async api might be nice, perhaps it can layer on or something.
- Making HTable thread agnostic might be useful. Pooling the write buffer or 
doing something else interesting there would be necessary.


> Rewrite our client
> --
>
> Key: HBASE-3523
> URL: https://issues.apache.org/jira/browse/HBASE-3523
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: stack
>
> Is it just me or do others sense that there is pressure building to redo the 
> client?  If just me, ignore the below... I'll just keep notes in here.  
> Otherwise, what would the requirements for a client rewrite look like?
> + Let out InterruptedException
> + Enveloping of messages or space for metadata that can be passed by client 
> to server and by server to client; e.g. the region a.b.c moved to server 
> x.y.z. or scanner is finished or timeout
> + A different RPC? One with tighter serialization.
> + More sane timeout/retry policy.
> Does it have to support async communication?  Do callbacks?
> What else?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (HBASE-3523) Rewrite our client

2011-02-10 Thread stack (JIRA)
Rewrite our client
--

 Key: HBASE-3523
 URL: https://issues.apache.org/jira/browse/HBASE-3523
 Project: HBase
  Issue Type: Brainstorming
Reporter: stack


Is it just me or do others sense that there is pressure building to redo the 
client?  If just me, ignore the below... I'll just keep notes in here.  
Otherwise, what would the requirements for a client rewrite look like?

+ Let out InterruptedException
+ Enveloping of messages or space for metadata that can be passed by client to 
server and by server to client; e.g. the region a.b.c moved to server x.y.z. or 
scanner is finished or timeout
+ A different RPC? One with tighter serialization.
+ More sane timeout/retry policy.

Does it have to support async communication?  Do callbacks?

What else?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually

2011-02-10 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3522.
--

   Resolution: Fixed
Fix Version/s: 0.92.0
 Assignee: stack
 Hadoop Flags: [Reviewed]

Applied to TRUNK (Thanks for review RR).

> Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- 
> region, master, region to master, and coprocesssors -- instead version each 
> individually
> --
>
> Key: HBASE-3522
> URL: https://issues.apache.org/jira/browse/HBASE-3522
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.92.0
>
>
> We'd undo the global RPC version so a change in CP Interface or a change in 
> the 'private' regionserver to master Interface would not break clients who do 
> not use CPs or who don't care about the private regionserver to master 
> protocol.
> Benoît suggested this.  I want it because I want to get rid of heartbeating 
> so will want to change the regionserver to master Interface.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually

2011-02-10 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993199#comment-12993199
 ] 

ryan rawson commented on HBASE-3522:


+1, commit that sucka!

> Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- 
> region, master, region to master, and coprocesssors -- instead version each 
> individually
> --
>
> Key: HBASE-3522
> URL: https://issues.apache.org/jira/browse/HBASE-3522
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>
> We'd undo the global RPC version so a change in CP Interface or a change in 
> the 'private' regionserver to master Interface would not break clients who do 
> not use CPs or who don't care about the private regionserver to master 
> protocol.
> Benoît suggested this.  I want it because I want to get rid of heartbeating 
> so will want to change the regionserver to master Interface.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually

2011-02-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993192#comment-12993192
 ] 

stack commented on HBASE-3522:
--

I posted a patch up on review.cloudera.org: https://review.cloudera.org/r/1561/

> Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- 
> region, master, region to master, and coprocesssors -- instead version each 
> individually
> --
>
> Key: HBASE-3522
> URL: https://issues.apache.org/jira/browse/HBASE-3522
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>
> We'd undo the global RPC version so a change in CP Interface or a change in 
> the 'private' regionserver to master Interface would not break clients who do 
> not use CPs or who don't care about the private regionserver to master 
> protocol.
> Benoît suggested this.  I want it because I want to get rid of heartbeating 
> so will want to change the regionserver to master Interface.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually

2011-02-10 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993191#comment-12993191
 ] 

ryan rawson commented on HBASE-3522:


this was previously required because the interfaces members from all our RPC 
interfaces were sorted then sequentially assigned IDs which were used in place 
of strings. Now that we send strings for the method names instead of ids, this 
jira is now possible.

> Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- 
> region, master, region to master, and coprocesssors -- instead version each 
> individually
> --
>
> Key: HBASE-3522
> URL: https://issues.apache.org/jira/browse/HBASE-3522
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>
> We'd undo the global RPC version so a change in CP Interface or a change in 
> the 'private' regionserver to master Interface would not break clients who do 
> not use CPs or who don't care about the private regionserver to master 
> protocol.
> Benoît suggested this.  I want it because I want to get rid of heartbeating 
> so will want to change the regionserver to master Interface.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually

2011-02-10 Thread stack (JIRA)
Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- 
region, master, region to master, and coprocesssors -- instead version each 
individually
--

 Key: HBASE-3522
 URL: https://issues.apache.org/jira/browse/HBASE-3522
 Project: HBase
  Issue Type: Improvement
Reporter: stack


We'd undo the global RPC version so a change in CP Interface or a change in the 
'private' regionserver to master Interface would not break clients who do not 
use CPs or who don't care about the private regionserver to master protocol.

Benoît suggested this.  I want it because I want to get rid of heartbeating so 
will want to change the regionserver to master Interface.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.

2011-02-10 Thread zhoushuaifeng (JIRA)
region be merged with others automatically when all data in the region has 
expired and removed, or region gets too small.
-

 Key: HBASE-3521
 URL: https://issues.apache.org/jira/browse/HBASE-3521
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.90.0
Reporter: zhoushuaifeng
Priority: Minor


We have test a cluster which have more than 30,000 regions, max size of a 
region is 512MB. At this situation, data no more growing, but remove some old 
data and insert new, and regions will be more and more. And some regions may be 
very small or empty. This occupies too much heapsize, and will be more if 
regions cannot be merged. This will limit hbase running for a long time. 
A script that does a survey to remove empty regions, or pick out adjacent small 
regions that then does the online merge up seems like it would be useful. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira