[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994649#comment-12994649 ] Hudson commented on HBASE-3524: --- Integrated in HBase-TRUNK #1745 (See [https://hudson.apache.org/hudson/job/HBase-TRUNK/1745/]) NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.0 Reporter: James Kennedy Assignee: ryan rawson Priority: Blocker Fix For: 0.90.1, 0.90.2 Attachments: 3524.txt I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993653#comment-12993653 ] James Kennedy commented on HBASE-3524: -- So that .meta file with DATA LOSS is definitely old (2010-05-20). Looking back over old logs i realized that DATA LOSS WARN has been there for a while. So probably that is a separate issue from this CompactionChecker problem. Guess I'll just delete the file in HDFS. So, it looks like my data is stable now after the forced compactions. I didn't have to apply the patch in production code to stop the NPEs. I'm still concerned about how this happened to some regions and not others since all were left up long enough to get to that NPE point which only prevented the first post-0.90.0 upgrade full compactions for 8 out of 50 tables. Maybe the other 42 were updated as part of the initial startup process... NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.0 Reporter: James Kennedy Assignee: James Kennedy Priority: Blocker Fix For: 0.90.1, 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993792#comment-12993792 ] James Kennedy commented on HBASE-3524: -- Why choose Long.MIN_VALUE? Wouldn't Long.MAX_VALUE encourage a major compaction and get pre-0.90.0 StoreFile's out of the picture sooner? NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.0 Reporter: James Kennedy Assignee: ryan rawson Priority: Blocker Fix For: 0.90.1, 0.90.2 Attachments: 3524.txt I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993794#comment-12993794 ] James Kennedy commented on HBASE-3524: -- duh, yep i get it. Just crossed a wire somewhere. NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.0 Reporter: James Kennedy Assignee: ryan rawson Priority: Blocker Fix For: 0.90.1, 0.90.2 Attachments: 3524.txt I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993318#comment-12993318 ] James Kennedy commented on HBASE-3524: -- Did some more debugging and got a little more intel: What's null on that line is sf.getReader().timeRangeTracker. It seems to be consistently null for many if not all tables. Anyone know how this could happen? NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993320#comment-12993320 ] James Kennedy commented on HBASE-3524: -- I found this in the hbase.log: [10/02/11 18:37:29] 44386 [1297391814420-0] WARN adoop.hbase.regionserver.Store - Skipping hdfs://localhost:7701/hbase/.META./1028785192/info/2685681686584745388 because its empty. HBASE-646 DATA LOSS? So perhaps this issue is a symptom of corrupt meta data. HOW can I fix this!? NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993321#comment-12993321 ] ryan rawson commented on HBASE-3524: Old files causing new code to break it seems. Good job tracking it down! NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993326#comment-12993326 ] James Kennedy commented on HBASE-3524: -- Thanks. I'm in a bit of a pickle. Though I tested all upgrades on QA and test data, this issue has only cropped up on a production deploy. Since our production app appeared to be running smoothly we gave it a +1 and there is already new user data in there. I'm wondering if I should revert to older data anyway (some user data loss) until this corruption is handled... Shouldn't 0.90.0 automatically upgrade old data? NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993327#comment-12993327 ] ryan rawson commented on HBASE-3524: the issue is that if the hfile does not have timerangeBytes, this code doesn't trigger: (StoreFile.java) if (timerangeBytes != null) { this.reader.timeRangeTracker = new TimeRangeTracker(); Writables.copyWritable(timerangeBytes, this.reader.timeRangeTracker); } And timeRangeTracker remains null. But this code doesnt check for null: (Store.java) 832long oldest = now - sf.getReader().timeRangeTracker.minimumTimestamp; if timeRangeTracker is null, we should probably use Integer.MIN_VALUE for minimumTimestamp. What is the creation time of your empty file? When is it from? Maybe it's old? NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993328#comment-12993328 ] ryan rawson commented on HBASE-3524: try this patch: diff --git a/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java b/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java index d7e3ce3..519111a 100644 --- a/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java +++ b/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java @@ -829,7 +829,10 @@ public class Store implements HeapSize { if (filesToCompact.size() == 1) { // Single file StoreFile sf = filesToCompact.get(0); -long oldest = now - sf.getReader().timeRangeTracker.minimumTimestamp; +long oldest = +(sf.getReader().timeRangeTracker == null) ? +Long.MIN_VALUE : +now - sf.getReader().timeRangeTracker.minimumTimestamp; if (sf.isMajorCompaction() (this.ttl == HConstants.FOREVER || oldest this.ttl)) { if (LOG.isDebugEnabled()) { no test yet! doh! NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993341#comment-12993341 ] James Kennedy commented on HBASE-3524: -- This patch obviously stops the npe and allows compaction checking to follow through. Furthermore I added a log output line that indicates when/what stores have .timeRangeTracker == null when encountered. It seemed that 7 or 8 tables (out of 50) had this problem and when i forced their major compaction from the hbase shell they stopped reporting the error. So it looks like the major compactions created new stores with timeRangeTracker properly. I'm still concerned though about how this happened in the first place and I need to do some thorough testing of the data to ensure nothing was lost. Ryan, in your opinion do you think this data is likely to have survived corruption? And thanks for your speedy help. NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993365#comment-12993365 ] James Kennedy commented on HBASE-3524: -- What is the creation time of your empty file? When is it from? Maybe it's old? Let me re-reproduce these issues from scratch tomorrow morning. NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira