date:20110516

[jira] [Updated] (HBASE-3892) Table can't disable

2011-05-16 Thread gaojinchao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-3892:
--

Attachment: Hmaster_0.90.patch

> Table can't disable
> ---
>
> Key: HBASE-3892
> URL: https://issues.apache.org/jira/browse/HBASE-3892
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.3
>Reporter: gaojinchao
> Fix For: 0.90.4
>
> Attachments: Hmaster_0.90.patch
>
>
> In TimeoutMonitor : 
> if node exists and node state is RS_ZK_REGION_CLOSED
> We should send a zk message again when close region is timeout.
> in this case, It may be loss some message.
> I See. It seems like a bug. This is my analysis.
> // disable table and master sent Close message to region server, Region state 
> was set PENDING_CLOSE
> 2011-05-08 17:44:25,745 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
> usedHeap=4097, maxHeap=8175) for region 
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
> 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> // received splitting message and cleared Region state (PENDING_CLOSE)
> 2011-05-08 17:46:45,303 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 
> 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, 
> load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
> 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
>  Daughters; 
> ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
>  
> ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
>  from C4C4.site,60020,1304820199467
> 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> ufdr,20110

[jira] [Created] (HBASE-3892) Table can't disable

2011-05-16 Thread gaojinchao (JIRA)

Table can't disable
---

 Key: HBASE-3892
 URL: https://issues.apache.org/jira/browse/HBASE-3892
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.3
Reporter: gaojinchao
 Fix For: 0.90.4


In TimeoutMonitor : 
if node exists and node state is RS_ZK_REGION_CLOSED
We should send a zk message again when close region is timeout.
in this case, It may be loss some message.


I See. It seems like a bug. This is my analysis.

// disable table and master sent Close message to region server, Region state 
was set PENDING_CLOSE

2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, 
regions=123, usedHeap=4097, maxHeap=8175) for region 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467

// received splitting message and cleared Region state (PENDING_CLOSE)

2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Overwriting 4418fb197685a21f77e151e401cf8b66 on 
serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, 
usedHeap=4097, maxHeap=8175)
2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: 
Received REGION_SPLIT: 
ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.:
 Daughters; 
ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62.,
 
ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
 from C4C4.site,60020,1304820199467
2011-05-08 17:52:47,309 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x22fcd582836003d Retrieved 125 byte(s) of data from znode 
/hbase/una

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread stack (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3889:
-

Priority: Blocker  (was: Major)
Assignee: stack  (was: Lars George)

Marking critical.  I'll pick up where Lars left off (with some help from 
Prakash).  Thanks Lars.

> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: stack
>Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
> at java.lang.Thread.run(Thread.java:680)
> {noformat}
> This should probably be handled - or at least documented - in another issue?
> The NPE made the log split end and the SplitLogManager add an endless amount 
> of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)

2011-05-16 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034298#comment-13034298
 ] 

stack commented on HBASE-1316:
--

Be our guest.  Thanks Joey.

> ZooKeeper: use native threads to avoid GC stalls (JNI integration)
> --
>
> Key: HBASE-1316
> URL: https://issues.apache.org/jira/browse/HBASE-1316
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.20.0
>Reporter: Andrew Purtell
>Assignee: Berk D. Demir
> Attachments: zk_wrapper.tar.gz
>
>
> From Joey Echeverria up on hbase-users@:
> We've used zookeeper in a write-heavy project we've been working on and 
> experienced issues similar to what you described. After several days of 
> debugging, we discovered that our issue was garbage collection. There was no 
> way to guarantee we wouldn't have long pauses especially since our 
> environment was the worst case for garbage collection, millions of tiny, 
> short lived objects. I suspect HBase sees similar work loads frequently, if 
> it's not constantly. With anything shorter than a 30 second session time out, 
> we got session expiration events extremely frequently. We needed to use 60 
> seconds for any real confidence that an ephemeral node disappearing meant 
> something was unavailable.
> We really wanted quick recovery so we ended up writing a light-weight wrapper 
> around the C API and used swig to auto-generate a JNI interface. It's not 
> perfect, but since we switched to this method we've never seen a session 
> expiration event and ephemeral nodes only disappear when there are network 
> issues or a machine/process goes down.
> I don't know if it's worth doing the same kind of thing for HBase as it adds 
> some "unnecessary" native code, but it's a solution that I found works.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)

2011-05-16 Thread Joey Echeverria (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034293#comment-13034293
 ] 

Joey Echeverria commented on HBASE-1316:


+1 on Todd et. al.'s design. I think it would be useful to use the JNI thread 
for (2) for the following reasons:

1) If the region server process goes down hard, but the machine is still up you 
won't see that failure for the long timeout (5 minutes) if the other session is 
in a separate JVM.
2) It's probably easier to manage a single JVM during startup and shutdown of a 
regionserver.

If nobody minds, I'd like to take a stab it generating a patch for this.

> ZooKeeper: use native threads to avoid GC stalls (JNI integration)
> --
>
> Key: HBASE-1316
> URL: https://issues.apache.org/jira/browse/HBASE-1316
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.20.0
>Reporter: Andrew Purtell
>Assignee: Berk D. Demir
> Attachments: zk_wrapper.tar.gz
>
>
> From Joey Echeverria up on hbase-users@:
> We've used zookeeper in a write-heavy project we've been working on and 
> experienced issues similar to what you described. After several days of 
> debugging, we discovered that our issue was garbage collection. There was no 
> way to guarantee we wouldn't have long pauses especially since our 
> environment was the worst case for garbage collection, millions of tiny, 
> short lived objects. I suspect HBase sees similar work loads frequently, if 
> it's not constantly. With anything shorter than a 30 second session time out, 
> we got session expiration events extremely frequently. We needed to use 60 
> seconds for any real confidence that an ephemeral node disappearing meant 
> something was unavailable.
> We really wanted quick recovery so we ended up writing a light-weight wrapper 
> around the C API and used swig to auto-generate a JNI interface. It's not 
> perfect, but since we switched to this method we've never seen a session 
> expiration event and ephemeral nodes only disappear when there are network 
> issues or a machine/process goes down.
> I don't know if it's worth doing the same kind of thing for HBase as it adds 
> some "unnecessary" native code, but it's a solution that I found works.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Lars George (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034291#comment-13034291
 ] 

Lars George commented on HBASE-3890:


Hi Prakash, thanks for the input! I think though this is unrelated, as this 
happened after the patch and restart. The messages should all have been the 
replay of the recovered logs. They do not show up since the first few will make 
them drop from the TaskMonitor because of the reuse.

I am not sure where and how this errs, and if it does at all. But I got those 
"leaked" log hints and the UI did not show any running tasks as it should have. 
So something is amiss, but I still need to check what is wrong.

> Scheduled tasks in distributed log splitting not in sync with ZK
> 
>
> Key: HBASE-3890
> URL: https://issues.apache.org/jira/browse/HBASE-3890
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
>Reporter: Lars George
> Fix For: 0.92.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode 
> is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in 
> hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. 
> In my case it could be related to the fact that the task was already in ZK 
> after many faulty restarts because of the NPE. Maybe it was added once (since 
> that is keyed by path, and that is unique on my machine), but the reference 
> count upped twice? Now that the real one is done, the done counter has been 
> increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the 
> scheduled task as bogus? This of course only treats the symptom, not the root 
> cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034222#comment-13034222
 ] 

Lars George commented on HBASE-3889:


Hi Prakash, the issue I had was a second RegionServer process running and still 
giving me this error. I only had the extra "continue", but it wasn't picked up. 
I forgot to remove the second "if". I agree on the error handling, I just added 
the log to get some info. 

I guess I shouldn't have made the patch available through JIRA but simply 
attach it for someone to pick up. I am on a tight deadline right now and won't 
be able to work on a unit test - although I would love to help out, once I get 
my current assignment completed.

> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
> at java.lang.Thread.run(Thread.java:680)
> {noformat}
> This should probably be handled - or at least documented - in another issue?
> The NPE made the log split end and the SplitLogManager add an endless amount 
> of RESCAN entries as this never came to an end.

--
This message is automatically generated by

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George updated HBASE-3889:
---

Status: Open  (was: Patch Available)

Patch is too simplistic and needs little cleanup and better error handling, 
plus should have a unit test.

> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
> at java.lang.Thread.run(Thread.java:680)
> {noformat}
> This should probably be handled - or at least documented - in another issue?
> The NPE made the log split end and the SplitLogManager add an endless amount 
> of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Prakash Khemani (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034149#comment-13034149
 ] 

Prakash Khemani commented on HBASE-3890:


With the bug you identified in HBASE-3889 this behavior is expected. The 
SplitLogManager will put up a task, a SplitLogWorker will pick it up and will 
never complete it because of the bug. Manager will resubmit the task and 
another worker will pick it up to never complete it. The Manager resubmits at 
most hbase.splitlog.max.resubmit (default = 3) times after which the task hangs.



> Scheduled tasks in distributed log splitting not in sync with ZK
> 
>
> Key: HBASE-3890
> URL: https://issues.apache.org/jira/browse/HBASE-3890
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
>Reporter: Lars George
> Fix For: 0.92.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode 
> is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in 
> hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. 
> In my case it could be related to the fact that the task was already in ZK 
> after many faulty restarts because of the NPE. Maybe it was added once (since 
> that is keyed by path, and that is unique on my machine), but the reference 
> count upped twice? Now that the real one is done, the done counter has been 
> increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the 
> scheduled task as bogus? This of course only treats the symptom, not the root 
> cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Prakash Khemani (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034139#comment-13034139
 ] 

Prakash Khemani commented on HBASE-3889:


Thanks Lars for finding and providing a fix for this issue. I have a few minor 
comments on the patch ...

The following isn't really needed, the earlier check you put in should be good 
enough.
{code}
+if (wap == null) {
+  continue;
+}
{code}


It might be better to catch Throwable in SplitLogWorker.run() and print the 
Unexpected Error message there. It might not be a good thing to ignore an 
unexpected exception in SplitLogWorker.grabTask() and continue.
{nofrmat}
+++ src/main/java/org/apache/hadoop/hbase/regionserver/SplitLogWorker.java  
(working copy)
@@ -297,6 +297,8 @@
   }
   break;
   }
+} catch (Exception e) {
+  LOG.error("An error occurred.", e);
 } finally {
   if (t > 0) {
 LOG.info("worker " + serverName + " done with task " + path +
{noformat}





> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
>

[jira] [Commented] (HBASE-2647) TestLogRolling flaky in trunk

2011-05-16 Thread Alex Newman (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034116#comment-13034116
 ] 

Alex Newman commented on HBASE-2647:


Could this be related to 
http://ria101.wordpress.com/2010/01/28/setup-hbase-in-pseudo-distributed-mode-and-connect-java-client/.
 Did the server launch on the external interface?

> TestLogRolling flaky in trunk
> -
>
> Key: HBASE-2647
> URL: https://issues.apache.org/jira/browse/HBASE-2647
> Project: HBase
>  Issue Type: Bug
>  Components: test, wal
>Affects Versions: 0.90.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> Saw this failure on my Hudson:
> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying 
> to locate root region because: Failed setting up proxy to 
> /192.168.42.22:44708 after attempts=1
>   at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:1023)
>   at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:630)
>   at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:606)
>   at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:676)
>   at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:635)
>   at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:606)
>   at org.apache.hadoop.hbase.client.HTable.(HTable.java:132)
>   at 
> org.apache.hadoop.hbase.HBaseClusterTestCase.hBaseClusterSetup(HBaseClusterTestCase.java:110)
>   at 
> org.apache.hadoop.hbase.HBaseClusterTestCase.setUp(HBaseClusterTestCase.java:147)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.setUp(TestLogRolling.java:131)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034032#comment-13034032
 ] 

Todd Lipcon commented on HBASE-3889:


Unit test?

> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
> at java.lang.Thread.run(Thread.java:680)
> {noformat}
> This should probably be handled - or at least documented - in another issue?
> The NPE made the log split end and the SplitLogManager add an endless amount 
> of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Lars George (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034015#comment-13034015
 ] 

Lars George commented on HBASE-3890:


I restarted the cluster (the pseudo distributed local instance) and then it 
kicked into gear with the replaying of the logs, taking up from where it was 
stuck before in limbo.

> Scheduled tasks in distributed log splitting not in sync with ZK
> 
>
> Key: HBASE-3890
> URL: https://issues.apache.org/jira/browse/HBASE-3890
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
>Reporter: Lars George
> Fix For: 0.92.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode 
> is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in 
> hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. 
> In my case it could be related to the fact that the task was already in ZK 
> after many faulty restarts because of the NPE. Maybe it was added once (since 
> that is keyed by path, and that is unique on my machine), but the reference 
> count upped twice? Now that the real one is done, the done counter has been 
> increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the 
> scheduled task as bogus? This of course only treats the symptom, not the root 
> cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-3891) TaskMonitor is used wrong in some places

2011-05-16 Thread Lars George (JIRA)

TaskMonitor is used wrong in some places


 Key: HBASE-3891
 URL: https://issues.apache.org/jira/browse/HBASE-3891
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
 Fix For: 0.92.0


I have a long running log replay in progress but none of the updates show. This 
is caused by reusing the MonitorTask references wrong, and manifests itself 
like this in the logs:

{noformat}
2011-05-16 15:22:18,127 WARN org.apache.hadoop.hbase.monitoring.TaskMonitor: 
Status org.apache.hadoop.hbase.monitoring.MonitoredTaskImpl@51bfa303 appears to 
have been leaked
2011-05-16 15:22:18,128 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
cleanup.
{noformat}

The cleanup sets the completion timestamp and causes the task to be purged from 
the list. After that the UI for example does not show any further running 
tasks, although from the logs I can see (with my log additions):

{noformat}
2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
setStatus: Compaction complete: 103.1m in 18542ms
2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
setStatus: Running coprocessor post-compact hooks
2011-05-16 15:29:52,296 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
setStatus: Compaction complete
2011-05-16 15:29:52,297 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
markComplete: Compaction complete
{noformat}

They are silently ignored as the TaskMonitor has dropped their reference. We 
need to figure out why a supposedly completed task monitor was reused.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

2011-05-16 Thread Lars George (JIRA)

Scheduled tasks in distributed log splitting not in sync with ZK


 Key: HBASE-3890
 URL: https://issues.apache.org/jira/browse/HBASE-3890
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
 Fix For: 0.92.0


This is in continuation to HBASE-3889:

Note that there must be more slightly off here. Although the splitlogs znode is 
now empty the master is still stuck here:

{noformat}
Doing distributed log split in 
hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765  
- Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s

Master startup  
- Splitting logs after master startup   4388s
{noformat}

There seems to be an issue with what is in ZK and what the TaskBatch holds. In 
my case it could be related to the fact that the task was already in ZK after 
many faulty restarts because of the NPE. Maybe it was added once (since that is 
keyed by path, and that is unique on my machine), but the reference count upped 
twice? Now that the real one is done, the done counter has been increased, but 
will never match the scheduled.

The code could also check if ZK is actually depleted, and therefore treat the 
scheduled task as bogus? This of course only treats the symptom, not the root 
cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George updated HBASE-3889:
---

Status: Patch Available  (was: Open)

> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
> at java.lang.Thread.run(Thread.java:680)
> {noformat}
> This should probably be handled - or at least documented - in another issue?
> The NPE made the log split end and the SplitLogManager add an endless amount 
> of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033994#comment-13033994
 ] 

Lars George commented on HBASE-3889:


After adding the patch I got eventually this

{noformat}
...
2011-05-16 14:15:31,428 DEBUG org.apache.hadoop.hbase.monitoring.MonitoredTask: 
markComplete: processed 50012 edits across 47 regions threw away edits for 13 
regions log file = 
hdfs://localhost/hbase/.logs/10.0.0.65,60020,1305406356765/10.0.0.65%2C60020%2C1305406356765.1305409968389
 is corrupted = false
2011-05-16 14:15:31,428 DEBUG 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: After Exec Status: DONE
2011-05-16 14:15:31,429 INFO 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: successfully transitioned 
task 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 to final state done
2011-05-16 14:15:31,429 INFO 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker 
10.0.0.64,60020,1305546944326 done with task 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
 in 879381ms
...
{noformat}

This was a 15GB edits that lingered since I had a crashed YSCB bulk load. The 
extra messages above are from log statements I added everywhere to see what is 
going on (might make sense to add a debug level output of 
MonitoredTaskImpl.setStatus(), was helpful here to see where it got lost).


> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
>

[jira] [Assigned] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George reassigned HBASE-3889:
--

Assignee: Lars George

> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
>Assignee: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
> at java.lang.Thread.run(Thread.java:680)
> {noformat}
> This should probably be handled - or at least documented - in another issue?
> The NPE made the log split end and the SplitLogManager add an endless amount 
> of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars George updated HBASE-3889:
---

Attachment: HBASE-3889.patch

Trivial patch, skips null reference, discarding the obsolete edit.

> NPE in Distributed Log Splitting
> 
>
> Key: HBASE-3889
> URL: https://issues.apache.org/jira/browse/HBASE-3889
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.0
> Environment: Pseudo-distributed on MacOS
>Reporter: Lars George
> Fix For: 0.92.0
>
> Attachments: HBASE-3889.patch
>
>
> There is an issue with the log splitting under the specific condition of 
> edits belonging to a non existing region (which went away after a split for 
> example). The HLogSplitter fails to check the condition, which is handled on 
> a lower level, logging manifests it as 
> {noformat}
> 2011-05-16 13:56:10,300 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's 
> directory doesn't exist: 
> hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
> very likely that it was already split so it's safe to discard those edits.
> {noformat}
> The code returns a null reference which is not check in 
> HLogSplitter.splitLogFileToTemp():
> {code}
> ...
> WriterAndPath wap = (WriterAndPath)o;
> if (wap == null) {
>   wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
>   if (wap == null) {
> logWriters.put(region, BAD_WRITER);
>   } else {
> logWriters.put(region, wap);
>   }
> }
> wap.w.append(entry);
> ...
> {code}
> The createWAP does return "null" when the above message is logged based on 
> the obsolete region reference in the edit.
> What made this difficult to detect is that the error (and others) are 
> silently ignored in SplitLogWorker.grabTask(). I added a catch and error 
> logging to see the NPE that was caused by the above.
> {code}
> ...
>   break;
>   }
> } catch (Exception e) {
>   LOG.error("An error occurred.", e);
> } finally {
>   if (t > 0) {
> ...
> {code}
> As a side note, there are other errors/asserts triggered that this 
> try/finally not handles. For example
> {noformat}
> 2011-05-16 13:58:30,647 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
> assert ownership for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for 
> /hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
> at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
> at java.lang.Thread.run(Thread.java:680)
> {noformat}
> This should probably be handled - or at least documented - in another issue?
> The NPE made the log split end and the SplitLogManager add an endless amount 
> of RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-3889) NPE in Distributed Log Splitting

2011-05-16 Thread Lars George (JIRA)

NPE in Distributed Log Splitting


 Key: HBASE-3889
 URL: https://issues.apache.org/jira/browse/HBASE-3889
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.0
 Environment: Pseudo-distributed on MacOS
Reporter: Lars George
 Fix For: 0.92.0


There is an issue with the log splitting under the specific condition of edits 
belonging to a non existing region (which went away after a split for example). 
The HLogSplitter fails to check the condition, which is handled on a lower 
level, logging manifests it as 

{noformat}
2011-05-16 13:56:10,300 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's directory 
doesn't exist: 
hdfs://localhost:8020/hbase/usertable/30c4d0a47703214845d0676d0c7b36f0. It is 
very likely that it was already split so it's safe to discard those edits.
{noformat}

The code returns a null reference which is not check in 
HLogSplitter.splitLogFileToTemp():

{code}
...
WriterAndPath wap = (WriterAndPath)o;
if (wap == null) {
  wap = createWAP(region, entry, rootDir, tmpname, fs, conf);
  if (wap == null) {
logWriters.put(region, BAD_WRITER);
  } else {
logWriters.put(region, wap);
  }
}
wap.w.append(entry);
...
{code}

The createWAP does return "null" when the above message is logged based on the 
obsolete region reference in the edit.

What made this difficult to detect is that the error (and others) are silently 
ignored in SplitLogWorker.grabTask(). I added a catch and error logging to see 
the NPE that was caused by the above.

{code}
...
  break;
  }
} catch (Exception e) {
  LOG.error("An error occurred.", e);
} finally {
  if (t > 0) {
...
{code}

As a side note, there are other errors/asserts triggered that this try/finally 
not handles. For example

{noformat}
2011-05-16 13:58:30,647 WARN 
org.apache.hadoop.hbase.regionserver.SplitLogWorker: BADVERSION failed to 
assert ownership for 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for 
/hbase/splitlog/hdfs%3A%2F%2Flocalhost%2Fhbase%2F.logs%2F10.0.0.65%2C60020%2C1305406356765%2F10.0.0.65%252C60020%252C1305406356765.1305409968389
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.ownTask(SplitLogWorker.java:329)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.access$100(SplitLogWorker.java:68)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$2.progress(SplitLogWorker.java:265)
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:432)
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:354)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:260)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:191)
at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:164)
at java.lang.Thread.run(Thread.java:680)
{noformat}

This should probably be handled - or at least documented - in another issue?

The NPE made the log split end and the SplitLogManager add an endless amount of 
RESCAN entries as this never came to an end.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3892) Table can't disable

[jira] [Created] (HBASE-3892) Table can't disable

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)

[jira] [Commented] (HBASE-1316) ZooKeeper: use native threads to avoid GC stalls (JNI integration)

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Commented] (HBASE-2647) TestLogRolling flaky in trunk

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

[jira] [Created] (HBASE-3891) TaskMonitor is used wrong in some places

[jira] [Created] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Commented] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Assigned] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Updated] (HBASE-3889) NPE in Distributed Log Splitting

[jira] [Created] (HBASE-3889) NPE in Distributed Log Splitting

20 matches

Site Navigation

Mail list logo

Footer information