[jira] Assigned: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Kennedy reassigned HBASE-3524: Assignee: James Kennedy > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.90.0 >Reporter: James Kennedy >Assignee: James Kennedy >Priority: Blocker > Fix For: 0.90.1, 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ryan rawson updated HBASE-3524: --- Component/s: regionserver Priority: Blocker (was: Major) Affects Version/s: 0.90.0 Fix Version/s: 0.90.1 > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.90.0 >Reporter: James Kennedy >Priority: Blocker > Fix For: 0.90.1, 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993365#comment-12993365 ] James Kennedy commented on HBASE-3524: -- > What is the creation time of your empty file? When is it from? Maybe it's old? Let me re-reproduce these issues from scratch tomorrow morning. > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993343#comment-12993343 ] ryan rawson commented on HBASE-3524: compaction is "optional", meaning if it fails no data is lost, so you should probably be fine. Older versions of the code did not write out time tracker data and that is why your older files were giving you npes. > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993341#comment-12993341 ] James Kennedy commented on HBASE-3524: -- This patch obviously stops the npe and allows compaction checking to follow through. Furthermore I added a log output line that indicates when/what stores have .timeRangeTracker == null when encountered. It seemed that 7 or 8 tables (out of 50) had this problem and when i forced their major compaction from the hbase shell they stopped reporting the error. So it looks like the major compactions created new stores with timeRangeTracker properly. I'm still concerned though about how this happened in the first place and I need to do some thorough testing of the data to ensure nothing was lost. Ryan, in your opinion do you think this data is likely to have survived corruption? And thanks for your speedy help. > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993328#comment-12993328 ] ryan rawson commented on HBASE-3524: try this patch: diff --git a/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java b/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java index d7e3ce3..519111a 100644 --- a/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java +++ b/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java @@ -829,7 +829,10 @@ public class Store implements HeapSize { if (filesToCompact.size() == 1) { // Single file StoreFile sf = filesToCompact.get(0); -long oldest = now - sf.getReader().timeRangeTracker.minimumTimestamp; +long oldest = +(sf.getReader().timeRangeTracker == null) ? +Long.MIN_VALUE : +now - sf.getReader().timeRangeTracker.minimumTimestamp; if (sf.isMajorCompaction() && (this.ttl == HConstants.FOREVER || oldest < this.ttl)) { if (LOG.isDebugEnabled()) { no test yet! doh! > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993327#comment-12993327 ] ryan rawson commented on HBASE-3524: the issue is that if the hfile does not have timerangeBytes, this code doesn't trigger: (StoreFile.java) if (timerangeBytes != null) { this.reader.timeRangeTracker = new TimeRangeTracker(); Writables.copyWritable(timerangeBytes, this.reader.timeRangeTracker); } And timeRangeTracker remains null. But this code doesnt check for null: (Store.java) 832long oldest = now - sf.getReader().timeRangeTracker.minimumTimestamp; if timeRangeTracker is null, we should probably use Integer.MIN_VALUE for minimumTimestamp. What is the creation time of your empty file? When is it from? Maybe it's old? > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993326#comment-12993326 ] James Kennedy commented on HBASE-3524: -- Thanks. I'm in a bit of a pickle. Though I tested all upgrades on QA and test data, this issue has only cropped up on a production deploy. Since our production app appeared to be running smoothly we gave it a +1 and there is already new user data in there. I'm wondering if I should revert to older data anyway (some user data loss) until this corruption is handled... Shouldn't 0.90.0 automatically upgrade old data? > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993320#comment-12993320 ] James Kennedy commented on HBASE-3524: -- I found this in the hbase.log: [10/02/11 18:37:29] 44386 [1297391814420-0] WARN adoop.hbase.regionserver.Store - Skipping hdfs://localhost:7701/hbase/.META./1028785192/info/2685681686584745388 because its empty. HBASE-646 DATA LOSS? So perhaps this issue is a symptom of corrupt meta data. HOW can I fix this!? > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993321#comment-12993321 ] ryan rawson commented on HBASE-3524: Old files causing new code to break it seems. Good job tracking it down! > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3524) NPE from CompactionChecker
[ https://issues.apache.org/jira/browse/HBASE-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993318#comment-12993318 ] James Kennedy commented on HBASE-3524: -- Did some more debugging and got a little more intel: What's null on that line is sf.getReader().timeRangeTracker. It seems to be consistently null for many if not all tables. Anyone know how this could happen? > NPE from CompactionChecker > -- > > Key: HBASE-3524 > URL: https://issues.apache.org/jira/browse/HBASE-3524 > Project: HBase > Issue Type: Bug >Reporter: James Kennedy > Fix For: 0.90.2 > > > I recently updated production data to use HBase 0.90.0. > Now I'm periodically seeing: > [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR > nServer$MajorCompactionChecker - Caught exception > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) > at > org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) > at > org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > The only negative effect is that this is interrupting compactions from > happening. But that is pretty serious and this might be a sign of data > corruption? > Maybe it's just my data, but this task should at least involve improving the > handling to catch the NPE and still iterate through the other onlineRegions > that might compact without error. The MajorCompactionChecker.chore() method > only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (HBASE-3524) NPE from CompactionChecker
NPE from CompactionChecker -- Key: HBASE-3524 URL: https://issues.apache.org/jira/browse/HBASE-3524 Project: HBase Issue Type: Bug Reporter: James Kennedy Fix For: 0.90.2 I recently updated production data to use HBase 0.90.0. Now I'm periodically seeing: [10/02/11 17:23:27] 30076066 [mpactionChecker] ERROR nServer$MajorCompactionChecker - Caught exception java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:832) at org.apache.hadoop.hbase.regionserver.Store.isMajorCompaction(Store.java:810) at org.apache.hadoop.hbase.regionserver.HRegion.isMajorCompaction(HRegion.java:2800) at org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker.chore(HRegionServer.java:1047) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) The only negative effect is that this is interrupting compactions from happening. But that is pretty serious and this might be a sign of data corruption? Maybe it's just my data, but this task should at least involve improving the handling to catch the NPE and still iterate through the other onlineRegions that might compact without error. The MajorCompactionChecker.chore() method only catches IOExceptions and so this NPE breaks out of that loop. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HBASE-1502) Remove need for heartbeats in HBase
[ https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-1502: - Priority: Blocker (was: Critical) Assignee: stack Made this a blocker. Need this to fix up the bug where we'd see splits before region had successfully opened. > Remove need for heartbeats in HBase > --- > > Key: HBASE-1502 > URL: https://issues.apache.org/jira/browse/HBASE-1502 > Project: HBase > Issue Type: Task >Reporter: Nitay Joffe >Assignee: stack >Priority: Blocker > Fix For: 0.92.0 > > Attachments: 1502.txt > > > HBase currently uses heartbeats between region servers and the master, > piggybacking information on them when it can. This issue is to investigate if > we can get rid of the need for those using ZooKeeper events. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HBASE-1502) Remove need for heartbeats in HBase
[ https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-1502: - Attachment: 1502.txt Here is a start -- the fun part. Removed HMsg and then redid the HMasterRegionInterface to remove the heartbeating. Was able to remove a bunch of other crud from Master and Regionserver. TODO: Splits, shutdown, and load. The report into the master by RS remains. It'll be passed back the hostname to use. It'll add this up into its znode (it'll only add znode after it has registered w/ Master). The znode will then be updated by the RS on a period with its load info. Master loadbalancer will read it from there. Regards Splits, after talking w/ Jon, its a new RIT state. Shutdown, I'll have think about it and write something up. > Remove need for heartbeats in HBase > --- > > Key: HBASE-1502 > URL: https://issues.apache.org/jira/browse/HBASE-1502 > Project: HBase > Issue Type: Task >Reporter: Nitay Joffe >Priority: Critical > Fix For: 0.92.0 > > Attachments: 1502.txt > > > HBase currently uses heartbeats between region servers and the master, > piggybacking information on them when it can. This issue is to investigate if > we can get rid of the need for those using ZooKeeper events. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3523) Rewrite our client
[ https://issues.apache.org/jira/browse/HBASE-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993279#comment-12993279 ] Jonathan Gray commented on HBASE-3523: -- A binary, language agnostic underlying RPC and wire protocol. Async, as an option, would be nice as well. I'd like more visibility and control into what is happening underneath with respect to connections to RegionServers and such. I don't like all the staticness and voodoo magic, at least not as the only option. The usage of like a hash of Configuration has always been weird to me. A better API for how errors are returned, for example, I can never understand how the MultiAction stuff without digging into code. +1 to your suggestions. We can already do stuff off the back of ZK for region movement if we wanted, but the opportunity for little hints in RPCs would be neat as well. Thanks for filing this stack. > Rewrite our client > -- > > Key: HBASE-3523 > URL: https://issues.apache.org/jira/browse/HBASE-3523 > Project: HBase > Issue Type: Brainstorming >Reporter: stack > > Is it just me or do others sense that there is pressure building to redo the > client? If just me, ignore the below... I'll just keep notes in here. > Otherwise, what would the requirements for a client rewrite look like? > + Let out InterruptedException > + Enveloping of messages or space for metadata that can be passed by client > to server and by server to client; e.g. the region a.b.c moved to server > x.y.z. or scanner is finished or timeout > + A different RPC? One with tighter serialization. > + More sane timeout/retry policy. > Does it have to support async communication? Do callbacks? > What else? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3523) Rewrite our client
[ https://issues.apache.org/jira/browse/HBASE-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993276#comment-12993276 ] ryan rawson commented on HBASE-3523: Things that are issues: - the use of proxy means that the interfaces _must have_ InterruptedException on the interface, or else you get "undeclared throwable exception", but now you are conflating a business contract (the interfaces) and networking/execution realities. Futhermore going through a proxy object isn't necessary, its just more layers, since few people directly code against the interfaces. - multiple level of timeouts causes unnecessary confusion. Also the retry loops in HCM cause confusion and issues. - client should support parallelism more directly, no more thread pools that just sleep! - lots of callables make the code harder to read, either get rid of them or use more inner classes. Jumping around files makes for difficult comprehension. Some good things: - the base socket handling is actually in good shape. 1 socket per client-rs pair is about where we want to be. - multiplexing requests on the same socket is good, not spawning extra threads server side just to handle more clients is also good. since every client will have an open socket to at least the META region, this is very important! - the handler pool is a natural side effect of the previous point, unbounding it might not be a good idea. Other constraints: - we will want to provide an efficient blocking API, it's what is expected. - an async api might be nice, perhaps it can layer on or something. - Making HTable thread agnostic might be useful. Pooling the write buffer or doing something else interesting there would be necessary. > Rewrite our client > -- > > Key: HBASE-3523 > URL: https://issues.apache.org/jira/browse/HBASE-3523 > Project: HBase > Issue Type: Brainstorming >Reporter: stack > > Is it just me or do others sense that there is pressure building to redo the > client? If just me, ignore the below... I'll just keep notes in here. > Otherwise, what would the requirements for a client rewrite look like? > + Let out InterruptedException > + Enveloping of messages or space for metadata that can be passed by client > to server and by server to client; e.g. the region a.b.c moved to server > x.y.z. or scanner is finished or timeout > + A different RPC? One with tighter serialization. > + More sane timeout/retry policy. > Does it have to support async communication? Do callbacks? > What else? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (HBASE-3523) Rewrite our client
Rewrite our client -- Key: HBASE-3523 URL: https://issues.apache.org/jira/browse/HBASE-3523 Project: HBase Issue Type: Brainstorming Reporter: stack Is it just me or do others sense that there is pressure building to redo the client? If just me, ignore the below... I'll just keep notes in here. Otherwise, what would the requirements for a client rewrite look like? + Let out InterruptedException + Enveloping of messages or space for metadata that can be passed by client to server and by server to client; e.g. the region a.b.c moved to server x.y.z. or scanner is finished or timeout + A different RPC? One with tighter serialization. + More sane timeout/retry policy. Does it have to support async communication? Do callbacks? What else? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually
[ https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3522. -- Resolution: Fixed Fix Version/s: 0.92.0 Assignee: stack Hadoop Flags: [Reviewed] Applied to TRUNK (Thanks for review RR). > Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- > region, master, region to master, and coprocesssors -- instead version each > individually > -- > > Key: HBASE-3522 > URL: https://issues.apache.org/jira/browse/HBASE-3522 > Project: HBase > Issue Type: Improvement >Reporter: stack >Assignee: stack > Fix For: 0.92.0 > > > We'd undo the global RPC version so a change in CP Interface or a change in > the 'private' regionserver to master Interface would not break clients who do > not use CPs or who don't care about the private regionserver to master > protocol. > Benoît suggested this. I want it because I want to get rid of heartbeating > so will want to change the regionserver to master Interface. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually
[ https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993199#comment-12993199 ] ryan rawson commented on HBASE-3522: +1, commit that sucka! > Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- > region, master, region to master, and coprocesssors -- instead version each > individually > -- > > Key: HBASE-3522 > URL: https://issues.apache.org/jira/browse/HBASE-3522 > Project: HBase > Issue Type: Improvement >Reporter: stack > > We'd undo the global RPC version so a change in CP Interface or a change in > the 'private' regionserver to master Interface would not break clients who do > not use CPs or who don't care about the private regionserver to master > protocol. > Benoît suggested this. I want it because I want to get rid of heartbeating > so will want to change the regionserver to master Interface. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually
[ https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993192#comment-12993192 ] stack commented on HBASE-3522: -- I posted a patch up on review.cloudera.org: https://review.cloudera.org/r/1561/ > Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- > region, master, region to master, and coprocesssors -- instead version each > individually > -- > > Key: HBASE-3522 > URL: https://issues.apache.org/jira/browse/HBASE-3522 > Project: HBase > Issue Type: Improvement >Reporter: stack > > We'd undo the global RPC version so a change in CP Interface or a change in > the 'private' regionserver to master Interface would not break clients who do > not use CPs or who don't care about the private regionserver to master > protocol. > Benoît suggested this. I want it because I want to get rid of heartbeating > so will want to change the regionserver to master Interface. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually
[ https://issues.apache.org/jira/browse/HBASE-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993191#comment-12993191 ] ryan rawson commented on HBASE-3522: this was previously required because the interfaces members from all our RPC interfaces were sorted then sequentially assigned IDs which were used in place of strings. Now that we send strings for the method names instead of ids, this jira is now possible. > Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- > region, master, region to master, and coprocesssors -- instead version each > individually > -- > > Key: HBASE-3522 > URL: https://issues.apache.org/jira/browse/HBASE-3522 > Project: HBase > Issue Type: Improvement >Reporter: stack > > We'd undo the global RPC version so a change in CP Interface or a change in > the 'private' regionserver to master Interface would not break clients who do > not use CPs or who don't care about the private regionserver to master > protocol. > Benoît suggested this. I want it because I want to get rid of heartbeating > so will want to change the regionserver to master Interface. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (HBASE-3522) Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually
Unbundle our RPC versioning; rather than a global for all 4 Interfaces -- region, master, region to master, and coprocesssors -- instead version each individually -- Key: HBASE-3522 URL: https://issues.apache.org/jira/browse/HBASE-3522 Project: HBase Issue Type: Improvement Reporter: stack We'd undo the global RPC version so a change in CP Interface or a change in the 'private' regionserver to master Interface would not break clients who do not use CPs or who don't care about the private regionserver to master protocol. Benoît suggested this. I want it because I want to get rid of heartbeating so will want to change the regionserver to master Interface. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (HBASE-3521) region be merged with others automatically when all data in the region has expired and removed, or region gets too small.
region be merged with others automatically when all data in the region has expired and removed, or region gets too small. - Key: HBASE-3521 URL: https://issues.apache.org/jira/browse/HBASE-3521 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.90.0 Reporter: zhoushuaifeng Priority: Minor We have test a cluster which have more than 30,000 regions, max size of a region is 512MB. At this situation, data no more growing, but remove some old data and insert new, and regions will be more and more. And some regions may be very small or empty. This occupies too much heapsize, and will be more if regions cannot be merged. This will limit hbase running for a long time. A script that does a survey to remove empty regions, or pick out adjacent small regions that then does the online merge up seems like it would be useful. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira