[jira] [Created] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)
NPE in Distributed Log Splitting


 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
 Fix For: 0.92.0


There is an issue with the log splitting under the specific condition of edits 
belonging to a non existing region (which went away after a split for example). 
The HLogSplitter fails to check the condition, which is handled on a lower 
level, logging manifests it as 

{noformat}
2011-05-16 13:56:10,300 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's directory 
doesn't exist: 
hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
very likely that it was already split so it's safe to discard those edits.
{noformat}

The code returns a null reference which is not check in 
HLogSplitter.splitLogFileToTemp():

{code}
...
WriterAndPath wap = (WriterAndPath)o;
if (wap == null) {
  wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
  if (wap == null) {
logWriters.put(region, BAD_WRITER);
  } else {
logWriters.put(region, wap);
  }
}
wap.w.append(entry);
...
{code}

The createWAP does return null when the above message is logged based on the 
obsolete region reference in the edit.

What made this difficult to detect is that the error (and others) are silently 
ignored in SplitLogWorker.grabTask(). I added a catch and error logging to see 
the NPE that was caused by the above.

{code}
...
  break;
  }
} catch (Exception e) {
  LOG.error(An error occurred., e);
} finally {
  if (t  0) {
...
{code}

As a side note, there are other errors/asserts triggered that this try/finally 
not handles. For example

{noformat}
2011-05-16 13:58:30,647 WARN 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
assert ownership for 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
at java.lang.Thread.run(Thread.java:680)
{noformat}

This should probably be handled - or at least documented - in another issue?

The NPE made the log split end and the SplitLogManager add an endless amount of 
RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George updated HBASE-3889:
---

Attachment: HBASE-3889.patch

Trivial patch, skips null reference, discarding the obsolete edit.

 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
 at java.lang.Thread.run(Thread.java:680)
 {noformat}
 This should probably be handled - or at least documented - in another issue?
 The NPE made the log split end and the SplitLogManager add an endless amount 
 of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George reassigned HBASE-3889:
--

Assignee: Lars George

 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
Assignee: Lars George
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
 at java.lang.Thread.run(Thread.java:680)
 {noformat}
 This should probably be handled - or at least documented - in another issue?
 The NPE made the log split end and the SplitLogManager add an endless amount 
 of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033994#comment-13033994
 ] 

Lars George commented on HBASE-3889:


After adding the patch I got eventually this

{noformat}
...
2011-05-16 14:15:31,428 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
markComplete: processed 50012 edits across 47 regions threw away edits for 13 
regions log file = 
hdfs://localhost/hbase/.logs/10.0.0.65,60020,1305406356765/10.0.0.65%2C60020%2C1305406356765.1305409968389
 is corrupted = false
2011-05-16 14:15:31,428 DEBUG 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: After Exec Status: DONE
2011-05-16 14:15:31,429 INFO 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: successfully transitioned 
task 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 to final state done
2011-05-16 14:15:31,429 INFO 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker 
10.0.0.64,60020,1305546944326 done with task 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 in 879381ms
...
{noformat}

This was a 15GB edits that lingered since I had a crashed YSCB bulk load. The 
extra messages above are from log statements I added everywhere to see what is 
going on (might make sense to add a debug level output of 
MonitoredTaskImpl.setStatus(), was helpful here to see where it got lost).


 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
Assignee: Lars George
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George updated HBASE-3889:
---

Status: Patch Available  (was: Open)

 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
Assignee: Lars George
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
 at java.lang.Thread.run(Thread.java:680)
 {noformat}
 This should probably be handled - or at least documented - in another issue?
 The NPE made the log split end and the SplitLogManager add an endless amount 
 of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Lars George (JIRA)
Scheduled tasks in distributed log splitting not in sync with ZK


 Key: HBASE-3890
 URL: https://issues.apache.org/jira/browse/HBASE-3890
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
 Fix For: 0.92.0


This is in continuation to HBASE-3889:

Note that there must be more slightly off here. Although the splitlogs znode is 
now empty the master is still stuck here:

{noformat}
Doing distributed log split in 
hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765  
- Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s

Master startup  
- Splitting logs after master startup   4388s
{noformat}

There seems to be an issue with what is in ZK and what the TaskBatch holds. In 
my case it could be related to the fact that the task was already in ZK after 
many faulty restarts because of the NPE. Maybe it was added once (since that is 
keyed by path, and that is unique on my machine), but the reference count upped 
twice? Now that the real one is done, the done counter has been increased, but 
will never match the scheduled.

The code could also check if ZK is actually depleted, and therefore treat the 
scheduled task as bogus? This of course only treats the symptom, not the root 
cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-3891) TaskMonitor is used wrong in some places

2011-05-16 Thread Lars George (JIRA)
TaskMonitor is used wrong in some places


 Key: HBASE-3891
 URL: https://issues.apache.org/jira/browse/HBASE-3891
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
 Fix For: 0.92.0


I have a long running log replay in progress but none of the updates show. This 
is caused by reusing the MonitorTask references wrong, and manifests itself 
like this in the logs:

{noformat}
2011-05-16 15:22:18,127 WARN org.apache.hadoop.hbase.monitoring.TaskMonitor: 
Status org.apache.hadoop.hbase.monitoring.MonitoredTaskImpl@51bfa303 appears to 
have been leaked
2011-05-16 15:22:18,128 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
cleanup.
{noformat}

The cleanup sets the completion timestamp and causes the task to be purged from 
the list. After that the UI for example does not show any further running 
tasks, although from the logs I can see (with my log additions):

{noformat}
2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
setStatus: Compaction complete: 103.1m in 18542ms
2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
setStatus: Running coprocessor post-compact hooks
2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
setStatus: Compaction complete
2011-05-16 15:29:52,297 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
markComplete: Compaction complete
{noformat}

They are silently ignored as the TaskMonitor has dropped their reference. We 
need to figure out why a supposedly completed task monitor was reused.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Lars George (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034015#comment-13034015
 ] 

Lars George commented on HBASE-3890:


I restarted the cluster (the pseudo distributed local instance) and then it 
kicked into gear with the replaying of the logs, taking up from where it was 
stuck before in limbo.

 Scheduled tasks in distributed log splitting not in sync with ZK
 

 Key: HBASE-3890
 URL: https://issues.apache.org/jira/browse/HBASE-3890
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
 Fix For: 0.92.0


 This is in continuation to HBASE-3889:
 Note that there must be more slightly off here. Although the splitlogs znode 
 is now empty the master is still stuck here:
 {noformat}
 Doing distributed log split in 
 hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765
 - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
 Master startup
 - Splitting logs after master startup   4388s
 {noformat}
 There seems to be an issue with what is in ZK and what the TaskBatch holds. 
 In my case it could be related to the fact that the task was already in ZK 
 after many faulty restarts because of the NPE. Maybe it was added once (since 
 that is keyed by path, and that is unique on my machine), but the reference 
 count upped twice? Now that the real one is done, the done counter has been 
 increased, but will never match the scheduled.
 The code could also check if ZK is actually depleted, and therefore treat the 
 scheduled task as bogus? This of course only treats the symptom, not the root 
 cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034032#comment-13034032
 ] 

Todd Lipcon commented on HBASE-3889:


Unit test?

 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
Assignee: Lars George
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
 at java.lang.Thread.run(Thread.java:680)
 {noformat}
 This should probably be handled - or at least documented - in another issue?
 The NPE made the log split end and the SplitLogManager add an endless amount 
 of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Prakash Khemani (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034139#comment-13034139
 ] 

Prakash Khemani commented on HBASE-3889:


Thanks Lars for finding and providing a fix for this issue. I have a few minor 
comments on the patch ...

The following isn't really needed, the earlier check you put in should be good 
enough.
{code}
+if (wap == null) {
+  continue;
+}
{code}


It might be better to catch Throwable in SplitLogWorker.run() and print the 
Unexpected Error message there. It might not be a good thing to ignore an 
unexpected exception in SplitLogWorker.grabTask() and continue.
{nofrmat}
+++ src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java  
(working copy)
@@ -297,6 +297,8 @@
   }
   break;
   }
+} catch (Exception e) {
+  LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 LOG.info(worker  + serverName +  done with task  + path +
{noformat}





 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
Assignee: Lars George
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
 at java.lang.Thread.run(Thread.java:680)
 {noformat}
 This should probably be handled - or at 

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Prakash Khemani (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034149#comment-13034149
 ] 

Prakash Khemani commented on HBASE-3890:


With the bug you identified in HBASE-3889 this behavior is expected. The 
SplitLogManager will put up a task, a SplitLogWorker will pick it up and will 
never complete it because of the bug. Manager will resubmit the task and 
another worker will pick it up to never complete it. The Manager resubmits at 
most hbase.splitlog.max.resubmit (default = 3) times after which the task hangs.



 Scheduled tasks in distributed log splitting not in sync with ZK
 

 Key: HBASE-3890
 URL: https://issues.apache.org/jira/browse/HBASE-3890
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
 Fix For: 0.92.0


 This is in continuation to HBASE-3889:
 Note that there must be more slightly off here. Although the splitlogs znode 
 is now empty the master is still stuck here:
 {noformat}
 Doing distributed log split in 
 hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765
 - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
 Master startup
 - Splitting logs after master startup   4388s
 {noformat}
 There seems to be an issue with what is in ZK and what the TaskBatch holds. 
 In my case it could be related to the fact that the task was already in ZK 
 after many faulty restarts because of the NPE. Maybe it was added once (since 
 that is keyed by path, and that is unique on my machine), but the reference 
 count upped twice? Now that the real one is done, the done counter has been 
 increased, but will never match the scheduled.
 The code could also check if ZK is actually depleted, and therefore treat the 
 scheduled task as bogus? This of course only treats the symptom, not the root 
 cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George updated HBASE-3889:
---

Status: Open  (was: Patch Available)

Patch is too simplistic and needs little cleanup and better error handling, 
plus should have a unit test.

 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
Assignee: Lars George
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
 at java.lang.Thread.run(Thread.java:680)
 {noformat}
 This should probably be handled - or at least documented - in another issue?
 The NPE made the log split end and the SplitLogManager add an endless amount 
 of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Lars George (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034291#comment-13034291
 ] 

Lars George commented on HBASE-3890:


Hi Prakash, thanks for the input! I think though this is unrelated, as this 
happened after the patch and restart. The messages should all have been the 
replay of the recovered logs. They do not show up since the first few will make 
them drop from the TaskMonitor because of the reuse.

I am not sure where and how this errs, and if it does at all. But I got those 
leaked log hints and the UI did not show any running tasks as it should have. 
So something is amiss, but I still need to check what is wrong.

 Scheduled tasks in distributed log splitting not in sync with ZK
 

 Key: HBASE-3890
 URL: https://issues.apache.org/jira/browse/HBASE-3890
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
 Fix For: 0.92.0


 This is in continuation to HBASE-3889:
 Note that there must be more slightly off here. Although the splitlogs znode 
 is now empty the master is still stuck here:
 {noformat}
 Doing distributed log split in 
 hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765
 - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
 Master startup
 - Splitting logs after master startup   4388s
 {noformat}
 There seems to be an issue with what is in ZK and what the TaskBatch holds. 
 In my case it could be related to the fact that the task was already in ZK 
 after many faulty restarts because of the NPE. Maybe it was added once (since 
 that is keyed by path, and that is unique on my machine), but the reference 
 count upped twice? Now that the real one is done, the done counter has been 
 increased, but will never match the scheduled.
 The code could also check if ZK is actually depleted, and therefore treat the 
 scheduled task as bogus? This of course only treats the symptom, not the root 
 cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)

2011-05-16 Thread Joey Echeverria (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034293#comment-13034293
 ] 

Joey Echeverria commented on HBASE-1316:


+1 on Todd et. al.'s design. I think it would be useful to use the JNI thread 
for (2) for the following reasons:

1) If the region server process goes down hard, but the machine is still up you 
won't see that failure for the long timeout (5 minutes) if the other session is 
in a separate JVM.
2) It's probably easier to manage a single JVM during startup and shutdown of a 
regionserver.

If nobody minds, I'd like to take a stab it generating a patch for this.

 ZooKeeper: use native threads to avoid GC stalls (JNI integration)
 --

 Key: HBASE-1316
 URL: https://issues.apache.org/jira/browse/HBASE-1316
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.20.0
Reporter: Andrew Purtell
Assignee: Berk D. Demir
 Attachments: zk_wrapper.tar.gz


 From Joey Echeverria up on hbase-users@:
 We've used zookeeper in a write-heavy project we've been working on and 
 experienced issues similar to what you described. After several days of 
 debugging, we discovered that our issue was garbage collection. There was no 
 way to guarantee we wouldn't have long pauses especially since our 
 environment was the worst case for garbage collection, millions of tiny, 
 short lived objects. I suspect HBase sees similar work loads frequently, if 
 it's not constantly. With anything shorter than a 30 second session time out, 
 we got session expiration events extremely frequently. We needed to use 60 
 seconds for any real confidence that an ephemeral node disappearing meant 
 something was unavailable.
 We really wanted quick recovery so we ended up writing a light-weight wrapper 
 around the C API and used swig to auto-generate a JNI interface. It's not 
 perfect, but since we switched to this method we've never seen a session 
 expiration event and ephemeral nodes only disappear when there are network 
 issues or a machine/process goes down.
 I don't know if it's worth doing the same kind of thing for HBase as it adds 
 some unnecessary native code, but it's a solution that I found works.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)

2011-05-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034298#comment-13034298
 ] 

stack commented on HBASE-1316:
--

Be our guest.  Thanks Joey.

 ZooKeeper: use native threads to avoid GC stalls (JNI integration)
 --

 Key: HBASE-1316
 URL: https://issues.apache.org/jira/browse/HBASE-1316
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.20.0
Reporter: Andrew Purtell
Assignee: Berk D. Demir
 Attachments: zk_wrapper.tar.gz


 From Joey Echeverria up on hbase-users@:
 We've used zookeeper in a write-heavy project we've been working on and 
 experienced issues similar to what you described. After several days of 
 debugging, we discovered that our issue was garbage collection. There was no 
 way to guarantee we wouldn't have long pauses especially since our 
 environment was the worst case for garbage collection, millions of tiny, 
 short lived objects. I suspect HBase sees similar work loads frequently, if 
 it's not constantly. With anything shorter than a 30 second session time out, 
 we got session expiration events extremely frequently. We needed to use 60 
 seconds for any real confidence that an ephemeral node disappearing meant 
 something was unavailable.
 We really wanted quick recovery so we ended up writing a light-weight wrapper 
 around the C API and used swig to auto-generate a JNI interface. It's not 
 perfect, but since we switched to this method we've never seen a session 
 expiration event and ephemeral nodes only disappear when there are network 
 issues or a machine/process goes down.
 I don't know if it's worth doing the same kind of thing for HBase as it adds 
 some unnecessary native code, but it's a solution that I found works.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3889:
-

Priority: Blocker  (was: Major)
Assignee: stack  (was: Lars George)

Marking critical.  I'll pick up where Lars left off (with some help from 
Prakash).  Thanks Lars.

 NPE in Distributed Log Splitting
 

 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
Assignee: stack
Priority: Blocker
 Fix For: 0.92.0

 Attachments: HBASE-3889.patch


 There is an issue with the log splitting under the specific condition of 
 edits belonging to a non existing region (which went away after a split for 
 example). The HLogSplitter fails to check the condition, which is handled on 
 a lower level, logging manifests it as 
 {noformat}
 2011-05-16 13:56:10,300 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
 directory doesn't exist: 
 hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
 very likely that it was already split so it's safe to discard those edits.
 {noformat}
 The code returns a null reference which is not check in 
 HLogSplitter.splitLogFileToTemp():
 {code}
 ...
 WriterAndPath wap = (WriterAndPath)o;
 if (wap == null) {
   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
   if (wap == null) {
 logWriters.put(region, BAD_WRITER);
   } else {
 logWriters.put(region, wap);
   }
 }
 wap.w.append(entry);
 ...
 {code}
 The createWAP does return null when the above message is logged based on 
 the obsolete region reference in the edit.
 What made this difficult to detect is that the error (and others) are 
 silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
 logging to see the NPE that was caused by the above.
 {code}
 ...
   break;
   }
 } catch (Exception e) {
   LOG.error(An error occurred., e);
 } finally {
   if (t  0) {
 ...
 {code}
 As a side note, there are other errors/asserts triggered that this 
 try/finally not handles. For example
 {noformat}
 2011-05-16 13:58:30,647 WARN 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
 assert ownership for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
 BadVersion for 
 /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
 at 
 org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
 at java.lang.Thread.run(Thread.java:680)
 {noformat}
 This should probably be handled - or at least documented - in another issue?
 The NPE made the log split end and the SplitLogManager add an endless amount 
 of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-3892) Table can't disable

2011-05-16 Thread gaojinchao (JIRA)
Table can't disable
---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4


In TimeoutMonitor : 
if node exists and node state is RS_ZK_REGION_CLOSED
We should send a zk message again when close region is timeout.
in this case, It may be loss some message.


I See. It seems like a bug. This is my analysis.

// disable table and master sent Close message to region server, Region state 
was set PENDING_CLOSE

2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, 
regions=123, usedHeap=4097, maxHeap=8175) for region 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467

// received splitting message and cleared Region state (PENDING_CLOSE)

2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting 4418fb197685a21f77e151e401cf8b66 on 
serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
usedHeap=4097, maxHeap=8175)
2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:52:47,309 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x22fcd582836003d Retrieved 125 byte(s) of data from znode 

[jira] [Updated] (HBASE-3892) Table can't disable

2011-05-16 Thread gaojinchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3892:
--

Attachment: Hmaster_0.90.patch

 Table can't disable
 ---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4

 Attachments: Hmaster_0.90.patch


 In TimeoutMonitor : 
 if node exists and node state is RS_ZK_REGION_CLOSED
 We should send a zk message again when close region is timeout.
 in this case, It may be loss some message.
 I See. It seems like a bug. This is my analysis.
 // disable table and master sent Close message to region server, Region state 
 was set PENDING_CLOSE
 2011-05-08 17:44:25,745 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
 usedHeap=4097, maxHeap=8175) for region 
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 // received splitting message and cleared Region state (PENDING_CLOSE)
 2011-05-08 17:46:45,303 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
 load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters; 
 ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
  
 ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
  from C4C4.site,60020,1304820199467
 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
 Received REGION_SPLIT: 
 ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
  Daughters;