[jira] [Updated] (HBASE-3892) Table can't disable
[ https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-3892: -- Attachment: Hmaster_0.90.patch > Table can't disable > --- > > Key: HBASE-3892 > URL: https://issues.apache.org/jira/browse/HBASE-3892 > Project: HBase > Issue Type: Bug >Affects Versions: 0.90.3 >Reporter: gaojinchao > Fix For: 0.90.4 > > Attachments: Hmaster_0.90.patch > > > In TimeoutMonitor : > if node exists and node state is RS_ZK_REGION_CLOSED > We should send a zk message again when close region is timeout. > in this case, It may be loss some message. > I See. It seems like a bug. This is my analysis. > // disable table and master sent Close message to region server, Region state > was set PENDING_CLOSE > 2011-05-08 17:44:25,745 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, > usedHeap=4097, maxHeap=8175) for region > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > // received splitting message and cleared Region state (PENDING_CLOSE) > 2011-05-08 17:46:45,303 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Overwriting > 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, > load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) > 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: > Daughters; > ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., > > ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. > from C4C4.site,60020,1304820199467 > 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: > Received REGION_SPLIT: > ufdr,20110
[jira] [Created] (HBASE-3892) Table can't disable
Table can't disable --- Key: HBASE-3892 URL: https://issues.apache.org/jira/browse/HBASE-3892 Project: HBase Issue Type: Bug Affects Versions: 0.90.3 Reporter: gaojinchao Fix For: 0.90.4 In TimeoutMonitor : if node exists and node state is RS_ZK_REGION_CLOSED We should send a zk message again when close region is timeout. in this case, It may be loss some message. I See. It seems like a bug. This is my analysis. // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // received splitting message and cleared Region state (PENDING_CLOSE) 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:47,309 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/una
[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3889: - Priority: Blocker (was: Major) Assignee: stack (was: Lars George) Marking critical. I'll pick up where Lars left off (with some help from Prakash). Thanks Lars. > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: stack >Priority: Blocker > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) > at java.lang.Thread.run(Thread.java:680) > {noformat} > This should probably be handled - or at least documented - in another issue? > The NPE made the log split end and the SplitLogManager add an endless amount > of RESCAN entries as this never came to an end. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)
[ https://issues.apache.org/jira/browse/HBASE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034298#comment-13034298 ] stack commented on HBASE-1316: -- Be our guest. Thanks Joey. > ZooKeeper: use native threads to avoid GC stalls (JNI integration) > -- > > Key: HBASE-1316 > URL: https://issues.apache.org/jira/browse/HBASE-1316 > Project: HBase > Issue Type: Improvement >Affects Versions: 0.20.0 >Reporter: Andrew Purtell >Assignee: Berk D. Demir > Attachments: zk_wrapper.tar.gz > > > From Joey Echeverria up on hbase-users@: > We've used zookeeper in a write-heavy project we've been working on and > experienced issues similar to what you described. After several days of > debugging, we discovered that our issue was garbage collection. There was no > way to guarantee we wouldn't have long pauses especially since our > environment was the worst case for garbage collection, millions of tiny, > short lived objects. I suspect HBase sees similar work loads frequently, if > it's not constantly. With anything shorter than a 30 second session time out, > we got session expiration events extremely frequently. We needed to use 60 > seconds for any real confidence that an ephemeral node disappearing meant > something was unavailable. > We really wanted quick recovery so we ended up writing a light-weight wrapper > around the C API and used swig to auto-generate a JNI interface. It's not > perfect, but since we switched to this method we've never seen a session > expiration event and ephemeral nodes only disappear when there are network > issues or a machine/process goes down. > I don't know if it's worth doing the same kind of thing for HBase as it adds > some "unnecessary" native code, but it's a solution that I found works. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)
[ https://issues.apache.org/jira/browse/HBASE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034293#comment-13034293 ] Joey Echeverria commented on HBASE-1316: +1 on Todd et. al.'s design. I think it would be useful to use the JNI thread for (2) for the following reasons: 1) If the region server process goes down hard, but the machine is still up you won't see that failure for the long timeout (5 minutes) if the other session is in a separate JVM. 2) It's probably easier to manage a single JVM during startup and shutdown of a regionserver. If nobody minds, I'd like to take a stab it generating a patch for this. > ZooKeeper: use native threads to avoid GC stalls (JNI integration) > -- > > Key: HBASE-1316 > URL: https://issues.apache.org/jira/browse/HBASE-1316 > Project: HBase > Issue Type: Improvement >Affects Versions: 0.20.0 >Reporter: Andrew Purtell >Assignee: Berk D. Demir > Attachments: zk_wrapper.tar.gz > > > From Joey Echeverria up on hbase-users@: > We've used zookeeper in a write-heavy project we've been working on and > experienced issues similar to what you described. After several days of > debugging, we discovered that our issue was garbage collection. There was no > way to guarantee we wouldn't have long pauses especially since our > environment was the worst case for garbage collection, millions of tiny, > short lived objects. I suspect HBase sees similar work loads frequently, if > it's not constantly. With anything shorter than a 30 second session time out, > we got session expiration events extremely frequently. We needed to use 60 > seconds for any real confidence that an ephemeral node disappearing meant > something was unavailable. > We really wanted quick recovery so we ended up writing a light-weight wrapper > around the C API and used swig to auto-generate a JNI interface. It's not > perfect, but since we switched to this method we've never seen a session > expiration event and ephemeral nodes only disappear when there are network > issues or a machine/process goes down. > I don't know if it's worth doing the same kind of thing for HBase as it adds > some "unnecessary" native code, but it's a solution that I found works. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK
[ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034291#comment-13034291 ] Lars George commented on HBASE-3890: Hi Prakash, thanks for the input! I think though this is unrelated, as this happened after the patch and restart. The messages should all have been the replay of the recovered logs. They do not show up since the first few will make them drop from the TaskMonitor because of the reuse. I am not sure where and how this errs, and if it does at all. But I got those "leaked" log hints and the UI did not show any running tasks as it should have. So something is amiss, but I still need to check what is wrong. > Scheduled tasks in distributed log splitting not in sync with ZK > > > Key: HBASE-3890 > URL: https://issues.apache.org/jira/browse/HBASE-3890 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 >Reporter: Lars George > Fix For: 0.92.0 > > > This is in continuation to HBASE-3889: > Note that there must be more slightly off here. Although the splitlogs znode > is now empty the master is still stuck here: > {noformat} > Doing distributed log split in > hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765 > - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0 4380s > Master startup > - Splitting logs after master startup 4388s > {noformat} > There seems to be an issue with what is in ZK and what the TaskBatch holds. > In my case it could be related to the fact that the task was already in ZK > after many faulty restarts because of the NPE. Maybe it was added once (since > that is keyed by path, and that is unique on my machine), but the reference > count upped twice? Now that the real one is done, the done counter has been > increased, but will never match the scheduled. > The code could also check if ZK is actually depleted, and therefore treat the > scheduled task as bogus? This of course only treats the symptom, not the root > cause of this condition. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034222#comment-13034222 ] Lars George commented on HBASE-3889: Hi Prakash, the issue I had was a second RegionServer process running and still giving me this error. I only had the extra "continue", but it wasn't picked up. I forgot to remove the second "if". I agree on the error handling, I just added the log to get some info. I guess I shouldn't have made the patch available through JIRA but simply attach it for someone to pick up. I am on a tight deadline right now and won't be able to work on a unit test - although I would love to help out, once I get my current assignment completed. > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) > at java.lang.Thread.run(Thread.java:680) > {noformat} > This should probably be handled - or at least documented - in another issue? > The NPE made the log split end and the SplitLogManager add an endless amount > of RESCAN entries as this never came to an end. -- This message is automatically generated by
[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars George updated HBASE-3889: --- Status: Open (was: Patch Available) Patch is too simplistic and needs little cleanup and better error handling, plus should have a unit test. > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) > at java.lang.Thread.run(Thread.java:680) > {noformat} > This should probably be handled - or at least documented - in another issue? > The NPE made the log split end and the SplitLogManager add an endless amount > of RESCAN entries as this never came to an end. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK
[ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034149#comment-13034149 ] Prakash Khemani commented on HBASE-3890: With the bug you identified in HBASE-3889 this behavior is expected. The SplitLogManager will put up a task, a SplitLogWorker will pick it up and will never complete it because of the bug. Manager will resubmit the task and another worker will pick it up to never complete it. The Manager resubmits at most hbase.splitlog.max.resubmit (default = 3) times after which the task hangs. > Scheduled tasks in distributed log splitting not in sync with ZK > > > Key: HBASE-3890 > URL: https://issues.apache.org/jira/browse/HBASE-3890 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 >Reporter: Lars George > Fix For: 0.92.0 > > > This is in continuation to HBASE-3889: > Note that there must be more slightly off here. Although the splitlogs znode > is now empty the master is still stuck here: > {noformat} > Doing distributed log split in > hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765 > - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0 4380s > Master startup > - Splitting logs after master startup 4388s > {noformat} > There seems to be an issue with what is in ZK and what the TaskBatch holds. > In my case it could be related to the fact that the task was already in ZK > after many faulty restarts because of the NPE. Maybe it was added once (since > that is keyed by path, and that is unique on my machine), but the reference > count upped twice? Now that the real one is done, the done counter has been > increased, but will never match the scheduled. > The code could also check if ZK is actually depleted, and therefore treat the > scheduled task as bogus? This of course only treats the symptom, not the root > cause of this condition. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034139#comment-13034139 ] Prakash Khemani commented on HBASE-3889: Thanks Lars for finding and providing a fix for this issue. I have a few minor comments on the patch ... The following isn't really needed, the earlier check you put in should be good enough. {code} +if (wap == null) { + continue; +} {code} It might be better to catch Throwable in SplitLogWorker.run() and print the Unexpected Error message there. It might not be a good thing to ignore an unexpected exception in SplitLogWorker.grabTask() and continue. {nofrmat} +++ src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java (working copy) @@ -297,6 +297,8 @@ } break; } +} catch (Exception e) { + LOG.error("An error occurred.", e); } finally { if (t > 0) { LOG.info("worker " + serverName + " done with task " + path + {noformat} > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) >
[jira] [Commented] (HBASE-2647) TestLogRolling flaky in trunk
[ https://issues.apache.org/jira/browse/HBASE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034116#comment-13034116 ] Alex Newman commented on HBASE-2647: Could this be related to http://ria101.wordpress.com/2010/01/28/setup-hbase-in-pseudo-distributed-mode-and-connect-java-client/. Did the server launch on the external interface? > TestLogRolling flaky in trunk > - > > Key: HBASE-2647 > URL: https://issues.apache.org/jira/browse/HBASE-2647 > Project: HBase > Issue Type: Bug > Components: test, wal >Affects Versions: 0.90.0 >Reporter: Todd Lipcon >Priority: Critical > > Saw this failure on my Hudson: > org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying > to locate root region because: Failed setting up proxy to > /192.168.42.22:44708 after attempts=1 > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:1023) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:630) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:606) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:676) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:635) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:606) > at org.apache.hadoop.hbase.client.HTable.(HTable.java:132) > at > org.apache.hadoop.hbase.HBaseClusterTestCase.hBaseClusterSetup(HBaseClusterTestCase.java:110) > at > org.apache.hadoop.hbase.HBaseClusterTestCase.setUp(HBaseClusterTestCase.java:147) > at > org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.setUp(TestLogRolling.java:131) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034032#comment-13034032 ] Todd Lipcon commented on HBASE-3889: Unit test? > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) > at java.lang.Thread.run(Thread.java:680) > {noformat} > This should probably be handled - or at least documented - in another issue? > The NPE made the log split end and the SplitLogManager add an endless amount > of RESCAN entries as this never came to an end. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK
[ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034015#comment-13034015 ] Lars George commented on HBASE-3890: I restarted the cluster (the pseudo distributed local instance) and then it kicked into gear with the replaying of the logs, taking up from where it was stuck before in limbo. > Scheduled tasks in distributed log splitting not in sync with ZK > > > Key: HBASE-3890 > URL: https://issues.apache.org/jira/browse/HBASE-3890 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 >Reporter: Lars George > Fix For: 0.92.0 > > > This is in continuation to HBASE-3889: > Note that there must be more slightly off here. Although the splitlogs znode > is now empty the master is still stuck here: > {noformat} > Doing distributed log split in > hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765 > - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0 4380s > Master startup > - Splitting logs after master startup 4388s > {noformat} > There seems to be an issue with what is in ZK and what the TaskBatch holds. > In my case it could be related to the fact that the task was already in ZK > after many faulty restarts because of the NPE. Maybe it was added once (since > that is keyed by path, and that is unique on my machine), but the reference > count upped twice? Now that the real one is done, the done counter has been > increased, but will never match the scheduled. > The code could also check if ZK is actually depleted, and therefore treat the > scheduled task as bogus? This of course only treats the symptom, not the root > cause of this condition. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-3891) TaskMonitor is used wrong in some places
TaskMonitor is used wrong in some places Key: HBASE-3891 URL: https://issues.apache.org/jira/browse/HBASE-3891 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.92.0 Reporter: Lars George Fix For: 0.92.0 I have a long running log replay in progress but none of the updates show. This is caused by reusing the MonitorTask references wrong, and manifests itself like this in the logs: {noformat} 2011-05-16 15:22:18,127 WARN org.apache.hadoop.hbase.monitoring.TaskMonitor: Status org.apache.hadoop.hbase.monitoring.MonitoredTaskImpl@51bfa303 appears to have been leaked 2011-05-16 15:22:18,128 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: cleanup. {noformat} The cleanup sets the completion timestamp and causes the task to be purged from the list. After that the UI for example does not show any further running tasks, although from the logs I can see (with my log additions): {noformat} 2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: setStatus: Compaction complete: 103.1m in 18542ms 2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: setStatus: Running coprocessor post-compact hooks 2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: setStatus: Compaction complete 2011-05-16 15:29:52,297 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: markComplete: Compaction complete {noformat} They are silently ignored as the TaskMonitor has dropped their reference. We need to figure out why a supposedly completed task monitor was reused. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK
Scheduled tasks in distributed log splitting not in sync with ZK Key: HBASE-3890 URL: https://issues.apache.org/jira/browse/HBASE-3890 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.92.0 Reporter: Lars George Fix For: 0.92.0 This is in continuation to HBASE-3889: Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here: {noformat} Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765 - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0 4380s Master startup - Splitting logs after master startup 4388s {noformat} There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled. The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars George updated HBASE-3889: --- Status: Patch Available (was: Open) > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) > at java.lang.Thread.run(Thread.java:680) > {noformat} > This should probably be handled - or at least documented - in another issue? > The NPE made the log split end and the SplitLogManager add an endless amount > of RESCAN entries as this never came to an end. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033994#comment-13033994 ] Lars George commented on HBASE-3889: After adding the patch I got eventually this {noformat} ... 2011-05-16 14:15:31,428 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: markComplete: processed 50012 edits across 47 regions threw away edits for 13 regions log file = hdfs://localhost/hbase/.logs/10.0.0.65,60020,1305406356765/10.0.0.65%2C60020%2C1305406356765.1305409968389 is corrupted = false 2011-05-16 14:15:31,428 DEBUG org.apache.hadoop.hbase.regionserver.SplitLogWorker: After Exec Status: DONE 2011-05-16 14:15:31,429 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: successfully transitioned task /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 to final state done 2011-05-16 14:15:31,429 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker 10.0.0.64,60020,1305546944326 done with task /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 in 879381ms ... {noformat} This was a 15GB edits that lingered since I had a crashed YSCB bulk load. The extra messages above are from log statements I added everywhere to see what is going on (might make sense to add a debug level output of MonitoredTaskImpl.setStatus(), was helpful here to see where it got lost). > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) >
[jira] [Assigned] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars George reassigned HBASE-3889: -- Assignee: Lars George > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George >Assignee: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) > at java.lang.Thread.run(Thread.java:680) > {noformat} > This should probably be handled - or at least documented - in another issue? > The NPE made the log split end and the SplitLogManager add an endless amount > of RESCAN entries as this never came to an end. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting
[ https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars George updated HBASE-3889: --- Attachment: HBASE-3889.patch Trivial patch, skips null reference, discarding the obsolete edit. > NPE in Distributed Log Splitting > > > Key: HBASE-3889 > URL: https://issues.apache.org/jira/browse/HBASE-3889 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.0 > Environment: Pseudo-distributed on MacOS >Reporter: Lars George > Fix For: 0.92.0 > > Attachments: HBASE-3889.patch > > > There is an issue with the log splitting under the specific condition of > edits belonging to a non existing region (which went away after a split for > example). The HLogSplitter fails to check the condition, which is handled on > a lower level, logging manifests it as > {noformat} > 2011-05-16 13:56:10,300 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's > directory doesn't exist: > hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is > very likely that it was already split so it's safe to discard those edits. > {noformat} > The code returns a null reference which is not check in > HLogSplitter.splitLogFileToTemp(): > {code} > ... > WriterAndPath wap = (WriterAndPath)o; > if (wap == null) { > wap = createWAP(region, entry, rootDir, tmpname, fs, conf); > if (wap == null) { > logWriters.put(region, BAD_WRITER); > } else { > logWriters.put(region, wap); > } > } > wap.w.append(entry); > ... > {code} > The createWAP does return "null" when the above message is logged based on > the obsolete region reference in the edit. > What made this difficult to detect is that the error (and others) are > silently ignored in SplitLogWorker.grabTask(). I added a catch and error > logging to see the NPE that was caused by the above. > {code} > ... > break; > } > } catch (Exception e) { > LOG.error("An error occurred.", e); > } finally { > if (t > 0) { > ... > {code} > As a side note, there are other errors/asserts triggered that this > try/finally not handles. For example > {noformat} > 2011-05-16 13:58:30,647 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to > assert ownership for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:106) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) > at java.lang.Thread.run(Thread.java:680) > {noformat} > This should probably be handled - or at least documented - in another issue? > The NPE made the log split end and the SplitLogManager add an endless amount > of RESCAN entries as this never came to an end. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-3889) NPE in Distributed Log Splitting
NPE in Distributed Log Splitting Key: HBASE-3889 URL: https://issues.apache.org/jira/browse/HBASE-3889 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.92.0 Environment: Pseudo-distributed on MacOS Reporter: Lars George Fix For: 0.92.0 There is an issue with the log splitting under the specific condition of edits belonging to a non existing region (which went away after a split for example). The HLogSplitter fails to check the condition, which is handled on a lower level, logging manifests it as {noformat} 2011-05-16 13:56:10,300 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's directory doesn't exist: hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is very likely that it was already split so it's safe to discard those edits. {noformat} The code returns a null reference which is not check in HLogSplitter.splitLogFileToTemp(): {code} ... WriterAndPath wap = (WriterAndPath)o; if (wap == null) { wap = createWAP(region, entry, rootDir, tmpname, fs, conf); if (wap == null) { logWriters.put(region, BAD_WRITER); } else { logWriters.put(region, wap); } } wap.w.append(entry); ... {code} The createWAP does return "null" when the above message is logged based on the obsolete region reference in the edit. What made this difficult to detect is that the error (and others) are silently ignored in SplitLogWorker.grabTask(). I added a catch and error logging to see the NPE that was caused by the above. {code} ... break; } } catch (Exception e) { LOG.error("An error occurred.", e); } finally { if (t > 0) { ... {code} As a side note, there are other errors/asserts triggered that this try/finally not handles. For example {noformat} 2011-05-16 13:58:30,647 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to assert ownership for /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389 at org.apache.zookeeper.KeeperException.create(KeeperException.java:106) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164) at java.lang.Thread.run(Thread.java:680) {noformat} This should probably be handled - or at least documented - in another issue? The NPE made the log split end and the SplitLogManager add an endless amount of RESCAN entries as this never came to an end. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira